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Preface 



Modern database systems enhance the capabilities of traditional database systems by 
their ability to handle any kind of data, including text, image, audio, and video. Today, 
database systems are particularly relevant to the Web, as they can provide input to content 
generators for Web pages, and can handle queries issued over the Internet. 

The extensible Markup Language (XML) is used in applications running the gamut 
from content management through publishing to Web services and e-commerce. It is 
used as the universal communication language for exchanging music and graphics as 
well as purchase orders and technical documentation. 

As database systems increasingly talk to each other over the Web, there is a fast- 
growing desire to use XML as the standard exchange format. As a result, many relational 
database systems can export data as XML documents and import data from XML doc- 
uments and provide query and update capabilities for XML data. In addition, so called 
native XML database and integration systems are appearing on the database market, 
whose claim is to be especially tailored to storing, maintaining, and easily accessing 
XML documents. 

After the huge success of the first XML Database Symposium (XSym 2003) last 
year in Berlin (already then in conjunction with VLDB) it was decided to establish this 
symposium as an annual event that is supposed to take place as an integral part of VLDB. 
The goal of this symposium is to provide a high-quality platform for the presentation and 
discussion of new research results and system developments. It is targeted at scientists, 
practitioners, vendors and users of XML and database technologies. 

The call-for-papers attracted about 60 submissions from all over the world. After a 
careful reviewing process, the international program committee accepted 15 high-quality 
papers of particular relevance and quality. The selected contributions cover a wide range 
of exciting topics, in particular XQuery processing, searching, ranking, and mapping 
XML documents, XML constraints checking and correcting, and XML processing. An 
exciting highlight of the symposium was the keynote by Mary Fernandez from AT&T 
Research. Her talk “Building an Extensible XQuery Engine: Experiences with Galax" 
was a perfect start for this symposium. 

As the editors of this volume, we would like to thank all the program committee 
members and external reviewers who sacrificed their valuable time to review the papers 
and helped in putting together a truly convincing program. 

We would also like to thank the organizers of VLDB 2004 (and of XSym), especially 
Iluju Kiringa, and the general chair of VLDB 2004, John Mylopoulos, without whom 
this symposium would not have been possible. Our special thanks also go to Akmal 
Chaudhri. He not only was always available when a helping hand was needed but he also 
did an excellent job with implementing and maintaining the XSym homepage. Finally, 
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we would like to thank Alfred Hofmann from Springer for his friendly cooperation and 
help in putting this volume together. 
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Building an Extensible XQuery Engine: 
Experiences with Galax 

(Extended Abstract) 
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mff ©research. att . com 

^ IBM T.J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY 10532, USA 
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Abstract. XQuery 1.0 and its sister language XPath 2.0 have set a fire 
underneath database vendors and researchers alike. More than thirty 
commercial and research XQuery implementations are listed on the XML 
Query working group home page (http://www.w3.org/XML/Query). Most 
of these implementations are targeted to particular storage systems or 
application domains. 

Galax (http : / /www . galaxquery . org) is an open-source, general-purpose 
XQuery engine, designed for maximal extensibility. In this talk, we will 
discuss Galax’s extensibility features and the design trade-offs that we 
continuously face between extensibility and performance. 



For the past four years, we have been actively involved in defining XQuery 
1.0 [10], a query language for XML designed to meet the diverse needs of ap- 
plications that query and exchange XML. XQuery 1.0 and its sister language 
XPath 2.0 are designed jointly by members of the World-wide Web Consortium’s 
XSLT and XML Query working groups, which includes constituencies such as 
large database vendors, small middle-ware start-ups, “power” XML-user com- 
munities, and industrial research labs. Each constituency has produced several 
XQuery implementations, which is unprecedented for a language that is (still!) 
not standardized. A current listing of XQuery implementations is on the XML 
Query working group home page (http://www.w3.org/XML/Query). 

Not surprisingly, each constituency has implemented XQuery with a partic- 
ular application domain, user base, or existing database architecture in mind. 
The large database vendors are concerned with implementing XQuery efficiently 
and compatibly within their relational database architectures. Draper [1] sur- 
veys the dual problems of storing XML data in relational storage systems and 
of publishing relational data in XML, when the relational database is entirely 
agnostic to XML data. In the same text, Rys [8] considers extended relational 
architectures that integrate XML data into the underlying data model and the 
techniques for compiling XQuery to SQL. The solutions described in these sur- 
veys are biased towards XML applications that can benefit from the features 
of mature relational database technologies: storage and indexing of large data 
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sources; transactional data access for updates; and high-performance relational 
query engines. 

Other XQuery implementations focus, for example, on the requirements of 
XML messaging applications, in which XML data is typically accessed in a 
streaming fashion. The BEA/XQRL streaming XQuery processor [5] is the first 
complete implementation of XQuery that supports streaming access to XML 
data. All input and output XML data is represented as a stream of tokens. The 
content of a token is similar to the content of a SAX event, but is enhanced with 
types identified during XML Schema validation. In the XQRL query engine, the 
token streams are accessed by pulling, not pushing, data. Every operator con- 
sumes and produces streams of tokens, which permits efficient discarding of data 
not required by the operator. These two examples of XQuery implementations 
represent two extrema: The relational engines are tailored to large, persistent, 
and mutable data sources and the streaming engines are tailored to small, tran- 
sient, and immutable data sources. 

Unlike other XQuery implementations, our implementation, called Galax[6], 
was not designed with a particular application domain or existing database archi- 
tecture in mind. Instead, Galax began as a platform for validating the XQuery 
formal semantics [11]. Validating the language definition required implement- 
ing the complete XQuery data model and all phases of the XQuery processing 
model^. Because completeness was a requirement, Galax was designed from the 
top down, and the resulting architecture, depicted in Figure 1, closely follows 
the XQuery processing model. The processing model consists of three inter- 
dependent phases: (a) XML Schema processing, (b) XML document processing, 
and (c) XQuery processing. For expediency, we chose the simplest possible imple- 
mentation strategy for each phase and focused on designing semantically simple 
and transparent representations of queries, documents, and types that are shared 
by phases. An important outcome of this strategy is that we were able to apply 
Galax to production applications that consume and produce large XML doc- 
uments and that have non-trivial query requirements [9]. In the talk, we will 
describe the processing model and related implementation issues in detail. 

While we worked on getting Galax up and running, myriad research papers on 
evaluation strategies for XPath and XQuery were popping up. We soon realized 
that Galax’s contribution might not be solutions to the numerous open problems 
of evaluating XQuery, but instead, Galax could provide a well-designed, open- 
source environment within which researchers (ourselves included) could explore 
evaluation strategies, document storage and indexing, interactions between opti- 
mizations and more. In such an environment, extensibility is paramount, so our 
recent work focuses on making Galax extensible in almost every phase. 

Gurrently, Galax provides extensibility features for the phases highlighted in 
Figure 1. Galax supports user-defined, built-in functions, which can be dynam- 
ically loaded into the Galax query engine. We have used this feature to provide 
built-in support for sending and receiving SOAP messages. The logical rewrit- 
ing phase is implemented using a generic tree-walking algorithm, into which new 

The first IPSI XQuery processor [2] was also designed for this purpose. 
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rewriting rules can be added declaratively. We have implemented a technique for 
detecting redundant sorting and duplicate-elimination operators [4] within this 
framework. Lastly, Galax’s abstract data model is implemented for three physical 
representations of documents: a main-memory tree representation, a secondary 
storage system that uses Berkeley DB and that is used in a production sys- 
tem [9], and a streaming file representation [7]. All three representations can be 
accessed simultaneously within a single query, and adding other representations 
is straightforward. 

Interestingly, Galax’s architecture has evolved to include features from both 
the classes of implementations that we first described. Like relational engines, 
Galax now includes a secondary storage system that provides indexes and we 
are adding physical operators to utilize the storage system. And like streaming 
engines, Galax includes operators that consume and produce streams of typed 
XML tokens. 

As Galax matures, balancing the requirements of extensibility and versatility 
with performance raises numerous issues. Our near-term future work focuses 
on implementing an algebraic query representation, whose efficient evaluation 
depends on the capabilities of the data representations underlying the abstract 
data model. In the talk, we will describe the design tensions that we face between 
extensibility and performance. 
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Abstract. We give a light-weight but formal introduction to XQuery by 
defining a sublanguage of XQuery. We ignore typing, and don’t consider 
namespaces, comments, programming instructions, and entities. To avoid 
confusion we call our version LiXQuery (Light XQuery). LiXQuery is 
fully downwards compatible with XQuery. Its syntax and its semantics 
are far less complex than that of XQuery, but the typical expressions of 
XQuery are included in LiXQuery. We claim that LiXQuery is an elegant 
and simple sublanguage of XQuery that can be used for educational and 
research purposes. We give the complete syntax and the formal semantics 
of LiXQuery. 



1 Introduction 

XQuery is considered to become the standard query language for XML- 
documents [1,10,7,9]. However, this language is rather complex and its semantics, 
although well defined (see [3,2,4]), is not easily defined in a precise and concise 
manner. There seems therefore to be a need for a sublanguage of XQuery that 
has almost the same expressive power as XQuery and that has an elegant syntax 
and semantics that can be written down in a few pages. Similar proposals were 
made for XPath 1.0 [12] and XSLT [6], and have subsequently played important 
roles in practical and theoretical research [8,11] 

Such a language would enable us to investigate more easily certain aspects 
of XQuery such as the expressive power of certain types of expressions found in 
XQuery, the expressive power of recursion in XQuery and possible syntactical 
restrictions that let us control this power, the complexity of deciding equivalence 
of expressions for purposes such as query optimization, the functional character 
of XQuery in comparison with functional languages such as LISP and ML, the 
role of XPath 1.0 and 2.0 in XQuery in terms of expressive power and query op- 
timization, and finally the relationship between XQuery queries and the classical 
well-understood concept of generic database queries. 

The contribution of this paper is the definition of LiXQuery, a sublanguage of 
XQuery with a relatively simple syntax and semantics that is appropriate both 
for educational and research purposes. Indeed, we are convinced that LiXQuery 
has a number of interesting properties, that can be proved formally, and that 
can be transposed to XQuery. 

* Roel Vercammen is supported by IWT - Institute for the Encouragement of Inno- 
vation by Science and Technology Flanders, grant number 31581. 
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Section 2 contains an example of a query in LiXQuery , while section 3 ex- 
plains the design choices we made. In Section 4 we give the complete syntax and 
its informal meaning. Section 5 further illustrates LiXQuery with more exam- 
ples and finally in Section 6 the formal semantics of LiXQuery is given. In the 
remainder of this paper the reader is assumed to be acquainted with XML and 
to have some notions about XQuery. 



2 Design Choices 



We designed LiXQuery with two audiences in mind. First of all, we target re- 
searchers investigating the expressive power and the computational complexity 
of XQuery. From experience, we know that the syntax and semantics in the 
XQuery standard is unwieldy for proving certain properties, hence we dropped 
a number of language features which are important for practical purposes but 
not essential from a theory perspective. The second audience for LiXQuery are 
teachers. Here as well we learned from experience that the XQuery standard 
contains features which are important for designing an efficient and practical 
language, but not essential to understand the typical queries written in XQuery. 

Therefore, we choose to omit a number of standard XQuery features. How- 
ever, to ensure the validity of LiXQuery, we designed it as a proper sublanguage. 
Specifically, we specified LiXQuery so that all syntactically valid LiXQuery ex- 
pressions do also satisfy the XQuery syntax. Moreover, the LiXQuery semantics 
is defined in such a way that the result of a query evaluated using our semantics 
will be a proper subset of the same query evaluated by XQuery. Of course, the 
lack of a complete formal semantics for XQuery does not allow us to prove that 
relation. 

The most visible feature we dropped from XQuery are types (and conse- 
quently type coercion). Types are indeed important for certain query optimiza- 
tions because they enable to catch certain mistakes at compile time. Moreover, 
type coercion is quite convenient when dealing with semi-structured data, as it 
allows for shorter expressions. Unfortunately, types -especially type coercions- 
add lots of complexity to a formal semantic definition of a language. And since 
types are optional in XQuery anyway, we decided to omit them for our sub- 
language. 

Secondly, we removed most of the axes of navigation namely the horizontal 
ones (‘following’, ‘following-sibling’, ‘preceding’, ‘preceding-sibling’) and half of 
the vertical ones (‘ancestor’) preserving only the ‘descendant-or-self’ and ‘child’ 
directions. Indeed, it has been shown formally that all other axes of navigation 
can be reduced to the ones we preserved, thus from a theory perspective such 
a simplification makes sense. From an educational perspective, it is sufficient 
to observe that the extra navigation axes are rarely needed, hence add to the 
cognitive overhead. 

Finally, we omitted primitive functions and primitive data-types, the order 
by clause, namespaces, comments, programming instructions and entities. For 
these features we argue that they are necessary to specify a full-fledged query 
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language, yet add to much overhead to incorporate them a concise, yet formal 
semantics. 



declare function oneLevel($l,$p) { 
element { "part" }■ { 

attribute { "partid" }{ $p/@partld }, 

for $s in $l//part where $s/@part0f=$p/@partld return oneLevel($l,$s) 

} 

}; 



let $list := docC'partList .xml")/partList return 
element { "intList" } { 

for $p in $list//part [empty (@partOf)] return oneLevel ($list , $p) 

} 



Fig. 1. A LiXQuery query 



3 An Example 

The query in Fig. 1 restructures a list of parts, containing information about 
their containing parts, to an embedded list of the parts with their subparts [5]. 
For instance, the document of Fig. 2 will be transformed into that of Fig. 3. The 
query starts with the definition of the function oneLevel. This is followed by the 
let-clause that defines the variable $list whose value is the partList element 
on the file partList . xml. Then a new element is returned with name intList 
and which has as content the result of the function oneLevel that is called 
for each part-element $p in the $list element that has no partOf-attribute. 
The function oneLevel constructs a new part-element, with one attribute. It is 
named partid and its value is the string of the partid attribute of the element 
$p (the second parameter of oneLevel). Furthermore the element part has a 
child-element $s for each of the parts in the first parameter $1 and which is part 
of $p. For each such an $s the function oneLevel is called recursively. If the file 
partList. xml contains Fig. 2 the result is shown in Fig. 3. 

<?xml version ="1.0"?> 

<partList> 

<part partId="l"/> <part partld="2" part Of ="!"/> 

<part partld="3" partOf="l"/> <part partld="4" part0f="3"/> 

<part partId="5"/> <part partld="6" part0f="5"/> 

</partList> 



Fig. 2. Content of the file partList .xml 
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<intList> 

<part partld="l"> 

<part partId="2"/> 

<part partld="3"> <part partId="4"/> </part> 
</part> 

<part partld="5"> <part partId="6"/> </part> 

</ intList> 



Fig. 3. Result of the query in Fig. 1 



4 Syntax and Informal Description of LiXQuery 

We first give the syntax and the informal semantics of LiXQuery, and then extend 
it with some syntactic sugar. 



4.1 Basic Syntax 

The syntax of LiXQuery is given in Fig. 4 as an abstract syntax, i.e., it assumes 
that extra brackets and precedence rules are added for disambiguation. 

All queries in LiXQuery are syntactically correct in XQuery and their LiX- 
Query semantics is consistent with their XQuery syntax. Built-in functions for 
manipulation of basic values are omitted. The non-terminal {Name) refers to the 
set of names M which we will not describe in detail here except that the names 
are strings that must start with a letter or The non-terminal {String) refers 
to strings that are enclosed in double quotes such as in "abc" and {Integer) 
refers to integers without quotes such as 100, +100, and -100.^ Therefore the 
sets associated with {Name), {String) and {Integer) are pairwise disjoint. 

The syntax contains 24 rules. Their informal semantics is mostly straightfor- 
ward. Some of the rules were illustrated in the introductory example. 

The ambiguity between rule [5] and [24] is resolved by giving precedence to 
[5], and for path expressions we will assume that the operators “/” and “//” 
(rule [18]) are left associative and are preceded by the filter operation (rule [17]) 
in priority. 



4.2 Informal Semantics 

Since we assume that the reader is already somewhat familiar with XQuery we 
only describe here the semantics of some of the less common expressions. 

In rule [5] the built-in functions are declared. The function docO returns 
the document node that is the root of the tree that corresponds to the content 
of the file with the name that was given as its argument, e.g., docC'file.xq") 
indicates the document root of the content of the file file.xq. The function 
nameO gives the tag name of an element node or the attribute name of an 
attribute node. The function stringO gives the string value of an attribute 

Integers are the only numeric type that exists in LiXQuery. 



1 
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[1] {Query) -s- 

[2] {FunDef) 

[3] (Expr) — > 



[4] ( Var) — >• 

[5] (Builtln) — >■ 



[6] {If Expr) — >• 

[7] {ForExpr) — >■ 

[8] {LetExpr) — >■ 

[9] ( Concat) — > 

[10] {AndOr) — > 

[11] {ValCmp) — > 

[12] {NodeCmp) — >■ 

[13] {AddExpr) — >■ 

[14] {MultExpr) — >■ 

[15] {Union) — > 

[16] {Step) 

[17] {Filter) — > 

[18] {Path) — >• 

[19] {Literal) — >■ 

[20] {EmpSeq) — >• 

[21] {Constr) — > 



[22] {TypeSw) 

[23] (Tj/pe) ^ 

[24] {FunCall) 



{{FunDef) “ -A)* {Expr) 

“declare” “function” {Name) “(” (( Var)(“,” ( Far))*)?“)” 

{Expr) 

(Far) I {Builtln) \ {If Expr) \ {ForExpr) \ {LetExpr) \ {Coneat) \ 
{AndOr) \ {ValCmp) \ {NodeCmp) \ {AddExpr) \ {MultExpr) \ 

{ Union) \ {Step) \ {Filter) \ {Path) \ {Literal) \ {EmpSeq) \ 
{Constr) I {TypeSw) \ {FunCall) 

“$” {Name) 

“doc (” (Ftpr) “)” I “name (” (i?a:pr) “)” j “string (” (Fxpr) “)” j 

“xs : integer (” (fepr) “)” j “root (” (Ftpr) “)” j 

“concat (” (Fipr), {Expr)‘A” j “trueO” j “falseO” j 

“not (” (Ftpr) “)” I “count (” (Ftpr) “)” j “positionO” j “lastO” 

“if ”“(”(fepr)“)” “then”(fepr) “else” (fepr) 

“f or” ( Far) (“at” ( Far))? in” {Expr) “return” (Fxpr) 

“let” ( Far) “ : =” {Expr) “return” {Expr) 

{Expr) {Expr) 

{Expr) {“and” j “or” ){Expr) 

{Expr){“=” I “<”){Expr) 

{Expr) {“is” I “«”){Expr) 

{Expr) (“+” I “-”) {Expr) 

{Expr) (“*” I “idiv”) {Expr) 

{Expr) “ I ” {Expr) 

“.” ]“..” I (Aame) j “@”(Aame) j “=t^” j |“text()” 

{Expr)“ \.”{Expr)“Y 
{Expr){“/” I “//”){Expr) 

{String) j {Integer) 

“()” 

“element” (Fapr) “{” {Expr)“}” j 
“attribute” “{” {Expr) “}” “{” {Expr) “}” j 
“text” (Fxpr) j “document” “{” (Ftpr) 

“typeswitch ” “ {Expr) “)” (“case” {Type) “return” (Frpr))"*" 
“default” “return” (Ftpr) 

“xs:boolean” j “xs: integer” j “xs : string” j “element ()” j 
“attributeO” j “textO” j “document -node ()” 

{Name) “ (” {{Expr) {“ ,” {Expr ) )*)?“)” 



Fig. 4. Syntax for LiXQuery queries and expressions 



node or text node, and converts integers to strings. The function xs : integer ()^ 
converts strings to integers. The function rootO gives for a node the root of 
its tree. The function concat () concatenates strings. Rule [11] introduces the 
comparison operators for basic values. Note that “2 < 10” and “"10" < "2"” 
both hold. These comparison operators have existential semantics, i.e., they are 
true for two sequences if there is a basic value in one sequence and a basic value 
in the other sequence such that the comparison holds between these two basic 
values. Rule [12] gives the comparison operators for nodes where “is” detects 

^ “xs:” indicates a namespace. Although we do not handle namespaces we use them 
here to be compatible with XQuery. 
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the equality of nodes and “<<” compares nodes in document order. Rule [15] 
expresses the union of two node sequences, i.e., it returns a sequence of nodes 
that contains exactly all the nodes in the operands, contains no duplicates and 
is sorted in document order. Rule [21] gives the constructors for each type of 
node. The semantics of “element {ei}{e2}” is that an element node with name 
Cl and content 62 is created. The semantics of “attribute {ei}{e2}” is that an 
attribute node with name ei and value 62 is created. The semantics of “text 
{e}” is that a text node with value e is created. The semantics of “document 
{e}” is that a document node with attributes and content as in e is created. 
Rules [22] and [23] define the typeswitch-expression that checks whether a value 
belongs to certain types and for the first type that matches returns a certain 
value. 

4.3 Syntactic Sugar 

To allow for a shorter notation of certain very common expressions we introduce 
the following short-hands. 

The Empty Function. The function empty () is assumed to be declared as follows: 
declare function empty ( $sequence ) { count ( $sequence ) = 0 }■; 

Quantified Formulas. The expression “some $v in ei satisfies 62” is intro- 
duced as a shorthand for “not (empty (for $v in ei return if (62) then $v 
else ()))”, and “every $v in e\ satisfies 62” is introduced as a shorthand 
for “empty (for $v in ei return if (62) then () else $u)”. 

FLWOR Expression. When for- and let-expressions are nested we allow that 
the intermediate “return” is removed. E.g., “for $ui in ci return let $^2 
:= 62 return 63” may be written as “for $Wi in ei let $U2 := ^2 return 
63” . Furthermore we allow in for- and let-expressions the shorthand “where ci 
return 62” for “return if Ci then 62 else ()”. 

Coercion. Let e\ (or 62) have the form “string (e)” where the result of e is 
a sequence containing a single text node or a single attribute node. Then ei 
(or 62) can be replaced by e in the following expressions: “xs : integer (ei)”, 
“concat(ei ,62)”, “61=62”, “ei<62” and “attribute{e3}{e2}”. 

5 More Examples 

In this section we demonstrate the expressive power of LiXQuery. 

5.1 Simulating Deep Equality 

The first example shows that we can express deep equality of two sequences. This 
essentially means that we have to check whether two fragments are isomorphic 
except that we have to take into account that attributes are unordered. For a 
more formal definition see Definition 9. 
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declare function deepatl ($e , $f ) { 

(: detects whether the attributes of $e are equal in name eind value with those of $f :) 
every $ae in $e/@* satisfies 
some $af in $f/@* satisfies 

( name($ae)=name($af ) and string($ae)=string($af ) ) 

cind 

every $af in $f/@* satisfies 
some $ae in $e/@* satisfies 

( name($ae)=name($af ) and string($ae)=string($af ) ) 

>; 



declare function typetext($e) { 

(: verifies whether $e is a textnode :) 

typeswitch ($e) case textO return trueO default return false () 

>; 



declare function deepequal($se ,$sf ) { 

(: detects whether $se eind $sf are sequences of pairwise deep equal items :) 
if (empty($se) and empty($sf)) then trueO 
else 

if (empty ($se) or empty ($sf)) then falseO 
else 

if (typetext ($se [1] ) ) 

then if (typetext ($sf [1] ) ) 

then ( string($se [1] )=string($sf [1] ) and 

deepequal ($se [1 < positionO] , $sf[l < positionO]) ) 
else falseO 

else if (typetext ($sf [1] ) ) 
then falseO 

else ( name ($se [1] )=name($sf [1] ) and 
deepat 1 ($se [1] ,$sf [1] ) and 

deepequal ($se [1] /(* I text ()) , $sf [1] / (* I text 0 ) ) and 
deepequal ($se [1 < positionO], $sf[l < positionO]) 

) 



5.2 Simulation of Other Axes 



We can simulate all the axes that are not already directly supported in the 
syntax of LiXQuery. To demonstrate this we show here the following-sibling 
and ancestor axis. 



declcure function f ollowing-sibling($s) { 

(: retrieves all fs’s of the nodes in $s :) 
for $node in $s 

for $sib in $node/../* 
where $sib » $node 
return $sib 

}; 



declare function ancestor($s) ■( 

(: retrieves all anc's of the nodes in $s ;) 
for $node in $s 

for $anc in root($node)//. 

where some $v in $ainc//* satisfies $v is $node 
return $cOic 



5.3 Simulation of the Full stringO Function 

In LiXQuery the stringO function is only defined for integers, attribute nodes 
and text nodes, but in XQuery it is defined for all items. We can simulate this 
more general function as follows. 

declare function concatAll($x) { 

(; concatenate all strings in $x :) 
if ( empty ( $x ) ) 
then " " 

else concat($x[position()=l] , concatAll ($x [positionO >1] ) ) 

>; 
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declare function xqString($x) { 

(: simulates full xquery string function :) 
if ( empty ( $x ) ) 
then " " 

else typeswitch ($x) 

case document-node 0 return concatAll ($x/text () ) 
case elementO return concatAll($x/text()) 
default return string($x) 

}; 

5.4 Turing Completeness 

It is easy to see that the amount of arithmetic and recursion in LiXQuery allows 
us to express all partial recursive functions over numbers. It is also possible to 
simulate LISP. For this purpose we represent a LISP list ( (b c) d) as shown in 
Fig. 5. Given this representation we simulation the car, cdr and cons functions: 

declare function car($x) •[ $x/*[l] }■; 

declare function cdr($x) •[ element! "list" }■{ $x/*[l < positionO] } I; 
declare function cons($x,$y) { element! "list" }! $x,$y/* }■ }; 

Since we can also compare strings and have conditional expressions, it is easy 
to see that by using recursion we can define all partial recursive functions over 
LISP lists. 

<list> 

<list> <atom> b </atom> <atom> c </atom> </list> <atom> d </atom> 
</list> 



Fig. 5. Simulation of the LISP list ((b c) d) 



6 Formal Semantics 

We now proceed with the formal semantics of LiXQuery. 

Definition 1 (Atomic Value). We assume a set of booleans B = 
{true, false}, a set o/ strings S and a set o/ integers I that contains integers.^ 
Furthermore a set of Names Af C S is identified that contains those strings 
that may he used as tag names [1]. For each of these sets a strict total ordering, 
written as <, is presumed to exist. The set of all atomic values is A = B\JS\JI. 

We also assume the functions AtValueTo String : A — >■ 5 which is a function 
that maps the atomic values to their string representation, and StringToInteger : 
S ^ I which is partial function that maps strings that represent integers to their 
integer value. 

Definition 2 (Node). We assume four countably infinite sets of nodes V®, 
V“ and V* which respectively represent the set of document, element, attribute 
and text nodes. These sets are pairwise disjoint with each other and with the set 
of atomic values. 

® We denote the empty string as non-empty strings as for example “123” and the 
concatenation of two strings si and S 2 as si ■ S 2 . 
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The set of all nodes is denoted as V, i.e., V = U V® U V“ U V*. 

Expressions will be evaluated against an XML store which contains XML 
fragments. This store contains the fragments that are created as intermediate 
results, but also the web documents that are accessed by the expression. Al- 
though in practice these documents are materialized in the store when they are 
accessed for the first time, we will assume here that all documents are in fact 
already in the store when the expression is evaluated. 

Definition 3 (XML Store). An XML store is a 6-tuple St = {V, a, 6) 

with 

— V is a finite subset ofV; we write for E fl (resp. V® for V fl V®, 
for V V* forV (1 V*); 

— (V, E) is an acyclic directed graph (with nodes V and directed edges E) where 
each node has an in-degree of at most one, and hence it is composed of trees; 
if (to, n) G E then we say that n is a child of to;^ we denote by E* the 
reflexive transitive closure of E; 

— < is a strict partial order on V that compares exactly the different children 
of a common node, hence ((ni < rz2) V (ni = 712) V (n2 < ni)) 3to G 
V((m,ni) G E A (777,712) G E) 

— 77 : M® U 1 Af labels the element and attribute nodes with their node 
name; 

— cr : U y* — >■ 5 labels the attribute and text nodes with their string value; 

— S : S ^ V‘^ a partial function that associates with an URL or a file name, a 
document node. It is called the document function. This function represents 
all the URLs of the Web and all the names of the files, together with the 
documents they contain. We suppose that all these documents are in the 
store. 

The following properties have to hold for an XML store: 

— each document node of is the root of a tree; 

— attribute nodes of and text nodes of V* do not have any children; 

— in the < -order attribute children precede the element and text children, i.e. 
if ni < 772 and 772 G V°‘ then n\ G F“; 

— there are no adjacent text children, i.e. 7/771,772 G E* and n\ < 772 then there 

is an 773 G E® with n\ < < 772; 

~ for all text nodes nt of V*' holds <j{nt) 

— all the attribute children of a common node have a different name, i.e. if 
(to, 77 i), (to, 772) G E and t 7 i, 772 G then 77(771) yf 77(772). 

Definition 4 (Union of stores). Two stores St = {V,E,<,v,a,5) and St' = 
(V, E' , <', 77 ', cr', 5') are disjoint, denoted as St fl St' = 0, iff V AV' = %. The 
definition of the union of two disjoint stores St and St' , denoted as St U St' , is 
straightforward. 

^ As opposed to the terminology of XQuery, we consider attribute nodes as children 
of their associated element node. The definitions of parent, descendant and ancestor 
are straightforward. 
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Definition 5 (Root of a node). Given a store St, the root of one of its nodes 
n, denoted as root(n) is the unique root of the tree of n, i.e. root(n) = r iff 
(r, n) € E* and for no node s ^ r of St holds (s, r) G E* . 



Definition 6 (Item). An item of an XML store St is an atomic value in A or 
a node in St. 



We denote the empty sequence as (), non-empty sequences as for example (1, 2, 3) 
and the concatenation of two sequences l\ and I 2 as h o I 2 . The expression 
iVi G with ip{y,i) a formula of an item y and position i denotes the 

subsequence of I that is obtained by selecting from I all items that satisfy (p{y, i). 



Example 1. The XML store that is represented in 
Fig. 5 is S't = {V,E,<,v,a,5) and is shown in 
Fig. 6 . The set of nodes F® = {nf , n|, n§, nf}, 

V* = {nl,nlnl}, = $, E = 

{(nf,n|),(nf,nf),(n|,n|), (n|,n|), (n|,n^), 

(n§, rig), (nf, rig)}, the order relation < is defined 
by n| < n^,n^ < furthermore n{nl) = vfn^) = 
“list”, vfnff) = v{n%) = v{nf) = “atom” and 
a(n|) = “b”, a{n%) = “c”, a(n*) = “d”.^ 




Definition 7 (Document Order of a Store). g ^ml tree of Fig. 5 

A document order of a store St is a total 
order on V such that 



1 . if (ni,ri2) G E* and n\ yf U2 then n\ -Aist TI2,’ 

2 . if (711,712) G E* and n\ < 773 then {u2 «Cst 773); 

3 . if (771,772), (771,774) G E* and 772 < 773 < 774 then (774,773) G E* . 



1. and 2. define the preorder in a tree. 3. say that the nodes of a tree are clustered. 



The set of items in a sequence I is denoted as Set(Z). Given a sequence 
of nodes I in an XML store St we let Ord 5 i(/) denote the unique sequence 
V = {yi,... , ym) such that Set{l) = Set(/') and yi <^st ■ ■ ■ <^st Vm- 



6.1 Evaluation of Expressions 

Expressions are evaluated against an environment. Assuming that X is the set 
of LiXQuery-expressions this environment is defined as follows. 

Definition 8 (Environment). A 77 environment of an XML store St is a tuple 
En = (a, b, v, x, k, m) with 

® We do not mention here the documents on the Web and on files. 

® A store can have more than one document order, but we choose a fixed document 
order here that we denote by <Cst. 
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1. a partial function a : Af ^ Af* that maps a function name to its formal 
arguments; it is used in rule [ 1 , 2 , 24 ]; 

2. a partial function h : Af ^ X that maps a function name to the body of the 
function; it is also used in rules [ 1 , 2 , 24 ]; 

3. a partial function v : Af ^ (V U A)* that maps variable names to their 
values; 

4- X which is undefined or an item of St and indicates the context item; it is 
used in rule [16,17, 18]; 

5. k which is undefined or an integer and gives the position of the context item 
in the context sequence; it is used in rule [5,17,18]; 

6. m which is undefined or an integer and gives the size of the context sequence; 
it is used in rule [5,17,18]. 

If En is an environment, n a name and y an item then we let En[a(n) i— 1 y] 
(En[h{n) i— 1 y], En[v(ji) i— >■ y]) denote the environment that is equal to En 
except that the function a (b, v) maps n to y. Similarly, we let En[x. i— y] 
(ifn[k y], En[m j/]) denote the environment that is equal to En except 
that X (k, n) is defined as t/ if y yf _L and undefined otherwise. 

We write St,En h e (St',v) to denote that the evaluation of expression 
e against the XML store St and environment En of St may result in the new 
XML store St' and value v of St' . 

6.2 Semantic Rules 

In what follows we give the reasoning rules that are used to define the semantics 
of LiXQuery. Each rule consists of a set of premises and a conclusion of the form 
St,En h e (St',v). The free variables in the rules are always assumed to 
be universally quantified. We will use the following notation: v for values, x for 
items, n for nodes, r for roots, s for strings and names, / for function names, b 
for booleans , i for integers and e for expressions. 

Query (Rules [1] and [2]) A function declaration extends a and b and then 
the last expression is evaluated with these a and b. Function declarations are 
allowed to be mutually recursive. 

En = En[a{f) !->• (si, . . . Sm)][b(/) i-o- e] St, En \~ e {St' , v) 

St, En h declare function /(si, . . . , Sm){ e }; e' {St' , v) 

Variable (Rule [4]) 

St, En h $S {St, VEn{s)) 

Built-in Functions (Rule [5]) 

St,Enh {St',(s)) Ssu{s)=n St, En h e ^ {St', (n)) n e V" U V“ 
St,En h doc(e) {St',n) St,En h name(e) => {St' , {ust'{n))) 

St,En^r {St' ,{n)) n € V“ U V* 

St, En h string(e) => {St' , {nst' («■))) 
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St, En h e ^ {St' , (x)) x € A AtValueTo String {x) = s 

St,En h string(e) => {St', (s)) 

St, En \- e => {St' , (s)) s € S String To Integer{s) = i 

St,En h xs : integer(e) => {St' , (i)) 

St, En \- e => {St' , (n)) n G Vst' 

St, En h root(e) {St' , (root{n))) 

St, En h ei (•S'ii, (si)) si G 5 Sti, En h 62 {St2, {S2}) S2 G 5 

St, En h concat(ei, 62) {St2, (si • S2)) 



St, En h true() {St, (true)) 

St,En\- {St',{b)) b&B 
St, En h not(e) {St' , {-^b)) 



St, En h false)) ^ {St, (false)) 

St, En h e => {St' , {xi, . . . , Xm)) 
St,En h count(e) {St' , (m)) 



St, En h position() ^ {St, (k^^)) St, En h last() ^ {St, (m£„)) 

If-expression (Rule [6]). The semantics of the if-expression is given by two 
inference rules: one for the case the condition evaluates to true and one for 
false. Note that in each case only one of the branches is executed. 

St, En h e {St' , (true)) St' , En h ei ^ {St\,vi) 

St,En h if e then ei else 62 ^ (Sti,ti) 

St, En^e^ {St', (false)) St', En^ 62^ (S'fe, ^2) 

St,En h if e then ei else 62 ^ {St2,V2) 

For-expression (Rule [7]) The rule for for $s at $s' in e return e' specifies 
that first e is evaluated and then e' for each item in the result of e but with s 
and s' in the environment bound to the respectively the item in question and its 
position in the result of e. Finally the results for each item are concatenated to 
a single sequence. 

St, En \- e => {Sto, {xi, . . . , Xm}) Sto, En[\r{s) i-o- a;i][v(s') 1-^ l]\- e' => {Sti,vi) 

. . . StjYl — l , En[\-{s) i-o- x„i][v(s') !->■ m] h e' => {Stm,Vm) 

St, En h for $s at $s' in e return e ^ {Stm,v\ o . . . o Vm) 

Let-expression (Rule [8]) 

St, En h e {St' , v) St' , En[\r{s) i-o- n] h e' => {St" , v') 

St,En h let $s := e return e ^ {St",v') 

Concatenation (Rule [9]) 

St, En\-e' ^ {St', v'} St', En h e" ^ {St", v") 

St, En h e , e" => {St" , v' o v") 
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Boolean Operators (Rule [10]) 

St, En\-e' ^ (St', (&')) St', En h e" {St", (&")) b' , b" € B 
St, En h e and e" => {St" , {b' A b")) St, En h e or e' ^ {St!' , {b' V b"}) 

Atomic Value Comparisons (Rule [11]) 

St, En h e' => {St' , {x'l, . . . , x'^/)) 

x'l, . . . , x'^i e A St', En h e" {St", \x'!, . . . , x'^n)) x'{, . . . , x'^n £ A 

{^i {^i ^ ) 

St,En're = e" =» {St",{b^)) St,En^r e < e" ^ {St",{b<)) 

Node Comparisons (Rule [12]) 

St, En\-e' => {St', (n')) St', En h e" ^ {St", {n")) 
n',n" £ V b±s {n = n") {n <S^st" n") 

St, En h e is e" {St" , {b^s}) St, En h e << e' => {St!' , (6<)) 

Additions (Rule [13]) 

St,En^ e ^ {St! ,{d!)) St! ,En^ e ^ {St!' ,{d!')) d',d"€l 

St, En^e + e" ^ {St", {d' + d")) St, En\~e' - e" ^ {St", {d' - d")) 

Multiplications (Rule [14]) 

St,En^r e ^ {St' ,{d')) St' ,En^r e" ^ {St" ,{d")) d',d"€l 

St, En he'* e" => {St" , {d' x d")) St, En h e idiv e" ^ {St" , {d' /d")) 

Union (Rule [15]) 

St,En\- e' ^ {St',v') St' , En\- e" ^ {St" ,v") v',v"€V* 

St,En\-e I e" => (St”, Ordst//(Set(u') U Set(u"))) 

Axis Steps (Rule [16]) The semantics of a step consisting of an element name 
s is that all element children of the context node (indicated in the envorment 
by x) with name s are returned in document order. The semantics of the step 
consisting of the wild-card * is the same except that all element children of the 
context node are returned. 

XEn is defined {n,TCEn) € Est fln{n,XEn) € Est 

St,En h . ^ {St, (x£„)) St,En\- {St,{n}) St,En h .. ^ {St,{}) 

W = {n|(xB„, n) £ Est A n £ V® A vst{n) = s} 

St, En'r {St, Ordst(lT)) 

W = {n|(xf;„, n) £ Est A n £ V“ A ust{n) = s} 

St, En\-@s^ {St, Ordst(lT)) 

W = {n\{xEn,n) £ Est A n £ V®} W = {n|(x£„,n) £ Est A n £ V“} 



St, En\-*^ {St, Ordst(lT)) 



St, En\-@*^ {St, Ordst{W)) 
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W = {n|(xB„,n) e Est An G V*} 
St, En h text() ^ (St, Ordst(iy)) 



Filter-expression (Rule [17]) The semantics of e' [ e" ] is that first e' is 
evaluated, then for each item in the result of e' the expression e" is evaluated 
with X bound to this item, k to the position of the item in the result of e' and 
m to the number of items in the result of e'. The result of e" is a boolean or 
an integer, in which case it is converted to true if this integer is equal to k and 
to false otherwise. Finally, the result is the subsequence of the result of e' that 
contains exactly all items for which e" evaluated to true. 

St, En\- e ^ {Sto, {xi,. . . , Xm)) 

En = En[m. !->• m] Sto,En['x. !->• a;i][k 1] h e” => {Sti, {x'l)) 

. . . St-m—l, En[x. I— >■ Xm][k I— >■ m] h e” => {St m 5 (xL)) 
x'\, . . . , x'm & BUl V = {xi\(x'i € X A x'i = i) V (x'i € B A x'i)) 

St, En he' [ e" ] => {Stm, v) 

Path Expression (Rule [18]) The semantics of (e' / e") is as follows. First 
e' is evaluated. Then for each item in its result we bind in the environment x 
to this item, k to the position of x in the result of e' , and m to the number of 
items in the result of e' , and with this environment we evaluate e" . The results 
of all these evaluations are concatenated and finally this sequence is sorted by 
document order and the duplicates are removed. The result is only defined if all 
the evaluations of e" contain only nodes. 

St, En\- e ^ {Sto, {xi,. . . , x^)) 

En = En[m !->• m] Sto,En[x. !->• a:i][k !->• 1] h e" => {Sti,vi) 

. . . Stm — l , En'[x. !->• Xm][^ !->• m] h e" => {Stm, Vm) Vl, . . . ,Vm € V* 

St,En he' / e" => Ordst„i(Ui<i<mSet(ui))) 

St, En\- e ^ {Sto, {xi, . . . ,Xm}) Wi = {x £ FstoK^i,*) h (Fsto)*} 

Wm = {x £Vsto\{xm,x) £ {EstoT} {x\,... ,x'mi) = Or AstQ{yJl<i<mWi) 
En = En[m !->• m'j Sto, £'n'[x !->• a;'i][k 1] h e" => {Sti, vi) 

Stm'-l,En'[^^ Xm']\^^ rn']\- e' ^ {Stm' ,Vm') Vl, . . . ,Vm' 

St,En\- ell e" ^ (St^/, Ordst^, (Ui<i<„,/Set(wi))) 

Literal (Rule [19]) The result of a literal is simply a sequence with one element, 
viz., the atomic value the literal represents. 

Empty Sequence (Rule [20]) 

St,En£ {)^ {St,!)) 

Constructors (Rule [21]) Before we proceed with the presentation of the rule 
for the element constructor, we first introduce the notion of deep equality. This 
defines what it means for two nodes in an XML store to represent the same XML 
fragment. 

Definition 9 (Deep Equal). Given the XML store St = {V,E,<,v,a,5) and 
two nodes n\ and U 2 in St. n\ and U 2 are said to be deep equal, denoted as 
DpEq_ 5 ((ni, rz 2 ), if n\ and U 2 refer to two isomorphic trees, i.e., there is a 
one-to-one function h : Cm Cm with Cm = {n\{ni,n) G E*} for i = 1,2, 
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such that for each n,n' G Cm it holds that ( 1 ) if n G (V®, V“, then 
h{n) G V‘^ (V®, V“, V*), ( 2 ) if ir^n) = s then v{h{n)) = s, ( 3 ) if cr{n) = s' then 
a{h{n)) = s', ( 4 ) {n,n') G E iff {h{n),h{n')) G E and ( 5 ) if n,n' ^ V“ then 
n < n' iff h{n) < h{n'). 

The semantics of the element constructor element{e'}{e"} is the defined as 
follows. First e' is evaluated and assumed to result in a single legal element 
name. The e is evaluated and for the result we create a new store St^ that 
contains the new element with the result of e' as its name and with contents 
that are deep-equivalent with the result of e if we compare them item by item. 
Finally we add St^ to the original store and return the newly created element 
node. 

St,En\- e ^ {St\,{s)) s € Af Sti, En e" ^ (St2, {ni, . . . ,nm)) 
m,... ,Um G V St 4 = St2 U Sts n G Vst3 => (r, n) G Ests r G V® 

s Ord5fg ({n |(r, n ) G Est^f^ {n\ ) DpEqsi^(ni,n'i) 

DpEqgj^(n™,n^) ^ n,n' &V{{n<^st2 n) ^ {n n')) 

St, En h element{e'}{e”} ^ (St4, (r)) 

St,En h e' ( 5 'ti,(s)) 

s G Af Sti,En\- e" ^ {St2, {s'}) s' € S St^ = Sts U Sts Vsts = {r} 
r G V“ ust^ir) = s asts(r) = s' V n, n' G V((n <gt2 «-') <st4 «')) 
St,En h attribute{e'}{e"} (St4, (r)) 

St,En\- e=> {Sti,{s)) sG 5 -{“”} Sts^ShUSts Fst2 = W 

r G V* ast2 (r) = s V n, n' G V((n n) ^ (n <.st3 n!)) 

St,En h text{e} (Sts, (r)) 

St, En h e (•S'ti, (ni)) 

ni G V® Sts = Sti U St2 n G Vst^ ^ (r, n) G Est^ r G V'* 

(r,n2) G Est2 DpEq^^^ (m, 712) V n, n' G V((n <sti n') (n <543 n')) 

St, En h document{e} => {Sts, (r)) 

Typeswitch-expression (Rules [ 22 ] and [ 23 ]) Let |xs : booleam] = 
B, |xs : integer] = I, |xs : string] = S, |document-node()| = V®*, 

|attribute()| = V“, |text()| = V* and |element()| = V®. 

St, En h e {Sti, (x)) 

{x G [tj}\/ j = m + 1 ) Vi<i<j(r ^ |L]) Sti, En\- ej ^ (St2,v) 

St, En h typeswitch(e) case ti return ei . . . case tm return Cm 
d0f S.lll't r©tU.I'H 6 -Tn-^-l — ^ {St2,v) 

Function Call (Rule [ 24 ]) The semantics of /(ei, . . . , Cm) is that ei, . . . ,em 
are consecutively evaluated, and then the expression b(/) is evaluated with the 
variable names of a(/) bound to the results of ei, . . . , em- 

St,En h ei => {Sti,vi) 

. . . Stm—l, EnG Cm {^StrmVm ) aB„(/) = (si , . . . , Sm) 

En = Fn[v(si) 1-^ ui] . . . [v(sm) i-o- Um][x i-o- T, k i-o- T, m T] 

Stm,En' GhEnjf)^ {St',v') 

St,En h /(ei, . . . ,em) ^ {St',v') 
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7 Conclusion 

In this paper we have presented a fragment of XQuery called LiXQuery together 
with a formal and concise but complete description of its semantics that is consis- 
tent with the formal semantics of XQuery. We claim that this fragment captures 
the essence of XQuery as a query language and can therefore be used for educa- 
tional purposes, e.g., teaching XQuery, and research purposes, e.g., investigating 
the expressive power of XQuery fragments and query optimization in XQuery 
implementations. 
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Abstract. Establishing the hierarchical order among XML elements is 
an essential function of XML query processing techniques. Although most 
XML documents have an associated DTD or XML schema, the doc- 
ument structure information has not been utilized efficiently in query 
processing techniques proposed so far. In this paper, we propose a novel 
technique that uses DTD or XML schema to improve the disk I/O com- 
plexity of XML query processing. We present a schema-based numbering 
scheme called SPIDER that incorporates both structure information and 
tag names extracted from the document structure descriptions. Given 
the tag name and the identifier of an element, SPIDER can determine the 
tag names and the identifiers of the ancestor elements without disk I/O. 
Based on SPIDER, we designed a mechanism called VirtualJoin that sig- 
nificantly reduces disk I/O workload for processing XML queries. Our 
experiments indicated that SPIDER outperforms the structural join tech- 
niques Stack-Tree and PathStack in XML query processing, especially for 
XML queries with heavy join workload and large data sets. 



1 Introduction 

Since the structure of XML [16] data can be represented by a rooted label tree, 
queries about XML data typically specify elements by selection predicates and 
their tree structural relationship. Among a number of methods proposed for 
verifying the structural relationship of XML elements, applying a numbering 
scheme is a powerful technique. Recent approaches [9], [10], [11], [12], [13] to 
the problem combined a numbering scheme, where each element is represented 
by a 3-tuple (startPos, endPos, level), with the notion of structural joins, 
which select the pairs of XML elements from candidate sets such that a given 
hierarchical order holds. The structural join approach has attracted many studies 
because it facilitates queries about element type, order, and predicate, which are 
important components of XPath expressions. 

The order of elements must obey the schematic information described in 
the grammar, called the DTD or XML schema, which describes the structure 
of the class that the XML document belongs to. Since the document structure 
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descriptions make XML documents exchanged over the web share their gram- 
mars efficiently, most of the XML documents in use are associated with a DTD 
or XML schema. However, previous XML query processing techniques have not 
utilized the schematic information from DTD or XML schema efficiently. 

A drawback of existing structural join techniques is that they require index 
data about all elements participating in a query to be loaded. In this paper, we 
propose a novel mechanism for XML query processing called VirtualJoin that 
utilizes the schematic information extracted from the DTD and XML schema 
to improve the disk I/O complexity of XML query processing. The core of the 
mechanism is a schema-based numbering scheme called SPIDER (for Schema- 
based Path IDentifiER) that incorporates both structure and tag name informa- 
tion about XML elements. When SPIDER is used, if the identifier and tag name 
of an element are known, the identifier and tag name of the parent element, 
as well as those of all of its ancestor elements, can be determined without any 
disk I/O. Based on this property, VirtualJoin reduces the number of elements, 
whose index data is required for processing a query. 

To the best of our knowledge, SPIDER is the first numbering scheme that 
utilizes DTD and XML schema. Furthermore, SPIDER expresses the structure 
of XML documents by the identifiers of paths existing in the XML documents, 
rather than by the node identifiers used in previous numbering schemes. 



2 Preliminaries 



A DTD or an XML schema contains the description of the hierarchical rela- 
tionship of XML elements. For example, a DTD may have the element type 
declarations shown in Figure 1(a). The element hierarchy expressed by these 
element type declarations is depicted in Figure 1(b). 



< ! ELEMENT 

< ! ELEMENT 
< ! ELEMENT 



personnel (company, business, 
person+)> 

person (name, email?, person*) > 
name (family, given) > 



(a) Element type declarations 



personnel 




company business person + 




name email person * 




family given 



(b) Element hierarchy 



Fig. 1. Example of DTD. 

According to these declarations, any element company has a parent element 
personnel. The structure of an XML document is relatively simple if the tag 
name of the parent element of any element in the document can be determined 
uniquely. However, in our example, the condition is not true with an element 
person since its parent element may be either an element personnel or an 
element person. The multiple choice makes the structure of XML documents 
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more complex. For simplicity, we use the notation of DTD in the rest of this 
paper. 

3 SPIDER: A Schema-Based Numbering Scheme 

In this section, we describe the numbering scheme SPIDER, which embeds infor- 
mation about both the identifiers and tag names of XML elements. 



3.1 Design of SPIDER 

SPIDER utilizes the structure information described by pairs of parent-child tag 
names in the DTD. Intuitively, each pair of tag names of parent and child nodes 
in the DTD is mapped to an integer called parent-child tag indicator pctagid. 
The mapping is designed so that the tag name of a child node and the corre- 
sponding PCTAGID can uniquely determine the tag name of the parent node. In 
other words, the following dependencies, called structural dependencies of the 
DTD, hold: 

parent_tag, child_tag — ^-PCTAGID (1) 

child_tag, PCTAGID — >-parent_tag (2) 

Note that there are many mappings (1) such that dependency (2) holds. 



Representation of the DTD. The parent-child relationship in a DTD is rep- 
resented by a table StruDTD that has three columns partag, CHITAG, and pc- 
tagid containing the tag names of the parent elements, the tag names of the 
child elements, and the corresponding parent-child tag indicator, respectively. In 
terms of the relational data model, dependencies (1) and (2) mean partag and 
PCTAGID, respectively, functionally depends on the other two columns. Given a 
table StruDTD such that dependencies (1) and (2) hold, let mapping (1) be de- 
noted by the function findPCTAGID and mapping (2) be denoted by the function 
parentTAG. The maximum value in the column pctagid is called the fanout of 
StruDTD. Note that the size of StruDTD is normally small enough to be kept in 
main memory. 

The table StruDTD is constructed from the DTD or XML schema. The flexi- 
bility in design enables different variants of the table StruDTD from a given DTD 
or XML schema. Therefore, we can generate different variants of the SPIDER 
numbering scheme, depending on particular needs. 



3.2 Algorithm for Constructing SPIDER 

Given a table StruDTD such that dependencies (1) and (2) hold, let / denote the 
fanout of StruDTD. Let T denote an XML tree rooted at r. If n is a node of T then 
the parent node, the tag name, and the identifier of n are denoted by parent{n), 
n.tag, and n.sid, respectively, where sid means “structural identifier”. SPIDER 
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Example 1. An example of the table StruDTD 
corresponding to the element hierarchy in 
Figure 1, such that dependencies (1) and 
(2) hold, is shown in Table 1. Note that 
the pairs (personnel, person) and (person, 
person) have different pctagids. Figure 2 
shows an example of SPIDER that uses the ta- 
ble StruDTD, where the identifier of a node is 
larger than the identifier of any of its left sib- 
lings. 



Table 1. A table StruDTD. 



PARTAG 


CHITAG 


PCTAGID 


personnel 


company 


1 


personnel 


business 


2 


personnel 


person 


3 


person 


name 


2 


person 


email 


3 


person 


person 


4 


name 


family 


1 


name 


given 


2 



generates the identifiers of the nodes in an XML tree in a depth-first traversal 
as shown in Algorithm ConstructSPIDER. 



Algorithm: ConstructSPIDER 



Input: T rooted at r, StruDTD, / = max (findPCTAGIDO ) 

Output: SPIDER of nodes in T 

1. depth-first travel T 

2. if n is r 

3 . n . sid <r- 1 ; 

4. else p <r- parentin) •, 

5. pctagid findPCTAGIDCp.tag, n.tag); 

6. n.sid ^ f * {p.sid - 1) + 1 + pctagid-, 
endif 

endtravel 

The identifier of a child node of p is the sum of a basic value computed from 
the identifier of p plus the parent-child tag indicator of the child node and p 
(steps 5 and 6). Therefore, the structure information of DTD, expressed by the 
table StruDTD, is incorporated into the identifiers of nodes in XML trees. 

Example 2. Given a StruDTD in Table 1, suppose we have to generate the iden- 
tifier of a node name that is the first child of a node person, whose identifier is 
equal to 4. Since ConstructSPIDER uses a depth-first traversal, the identifier of 
the person is already known. According to Table 1, findPCTAGID(person, name) 
is equal to 2, and max(findPCTAGID()) is equal to 4. Therefore, the identifier of 
the name is 4 x (4 — 1) -P 1 -P 2, which is equal to 15. 

Coding complexity. The cost of the function findPCTAGID is equal to 
lo(/ 2 (size(StruDTD)). Therefore, the cost of Algorithm ConstructSPIDER is equal 
to 0(/og2(size(StruDTD)) x (number of nodes)). 

3.3 Index Functions for SPIDER 

We used the function findPCTAGID to embed the structure information into 
identifiers of nodes of an XML tree. For query processing, we need two index 
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personnel &1 




&58 &59 &67 &68 &67 &58 &59 

Fig. 2. Example of SPIDER 



functions parentSID and namelD as a means to restore the hierarchy of XML 
nodes based on their identifiers. Given a node n , the index function parentSID 
for computing the structural identifier of the parent node of a node is defined as 
follows: 

parentSID(n.szd) = [{n.sid — 2) / f \ + 1 (3) 

This function is similar to the one introduced in [8]. However, it is worth em- 
phasizing that we apply the function to a numbering scheme that is more com- 
prehensive but flexible than the scheme introduced in [8]. 

The second index function namelD for computing pctagid of n in its parent 
node is defined as follows: 

namelD(n.sz(i) = n.sid — / x [(n.szd — 2)//J — 1 (4) 

Lemma 1. Given an XML tree and a table StruDTD such that dependencies (1) 
and (2) hold, if the structural identifiers are generated by Algorithm Construct- 
SPIDER then for a given node, the structural identifier and the tag name of its 
parent node can be determined. 

The proof of Lemma 1 can be found in [20]. Given a node, it is possible to 
determine the tag names of all ancestor nodes of a node without any disk I/O 
by applying Lemma 1 recursively. As shown in Section 5, this property is used 
to avoid the intermediate structural joins and reduce the disk I/O workload for 
index data necessary for evaluating an XML query. 

Example 3. Given an XML tree in Figure 2, let us consider the node email 
having the identifier equal to 68. Since / = 4, the function parentSID(68) is 
[(68 - 2) / 4 J -I- 1, which is equal to 17. The function namelD(68) is 68 — 4 x 
[(68 — 2)/4j — 1, which is equal to 3. The function parentTAG(‘email’, 3) returns 
‘person’, according Table 1. Therefore, the parent node is person and has the 
identifier equal to 17. 
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4 Variants of SPIDER 

As discussed in Section 3, since there are various ways to construct a table 
StruDTD such that dependencies (1) and (2) hold, we have different variants of 
SPIDER for a given class of XML documents. In this section, we present two 
coding variants for two specific purposes. 



4.1 Unordered SPIDER 

The unordered variant of SPIDER (or USPIDER for short) has the smallest coding 
size, where the parent-child relationship is emphasized. For example, the ele- 
ments that correspond to the first b and the second b in the DTD declaration 
<! ELEMENT a (b?, c*, d, b) > are treated equally as unordered children of an 
element a. 

We use the following observation: for a given tag name h in a DTD, if there 
is only one tag a such that b appears in the element type declaration of a then 
dependency (2) holds whatever the value pctagid is. In other words, dependency 
(2) is reduced to parent tag — child tag. For such a pair (a, b), we can 
assign any number, e.g. zero or one, to pctagid to make coding small. 



Construction of the function findPCTAGID for USPIDER. Initially, the rows 
in the table StruDTD are generated corresponding to pairs of parent-child tags in 
the DTD and all the values pctagid are set to one. In order to make dependency 
(2) hold, we have to stretch the column pctagid of the table. For a given b, if 3 
ai, a. 2 , and b such that ai a .2 but findPCTAGID(ai, b) = findPCTAGID(a2, b) 
then set findPCTAGID(a2, b) = maxi ^2 (findPGTAGID(ai, b))-|-l. It is obvious 
that findPGTAGID(ai, b) yf findPGTAGID(a2, b). Since the cardinalities of XML 
DTDs are finite, the process always stops. 

4.2 Sibling-Order SPIDER 

This variant of SPIDER, called DSPIDER, is sibling-order sensitive. OSPIDER guar- 
antees the order of sibling nodes, i.e., the identifiers of the child nodes of a node 
reflect the order in which these nodes appear in the parent node. 



Transformation of complex DTD declarations to the basic form. In 

general, a DTD can be complex because of the complex specification of the type 
of an element. For example, an element a can be defined by a DTD declaration 
such as <! ELEMENT a ((b, c, d)*, (e, c) *)>. Theoretically, it is possible to 
ask the fifth element c in an element a by specifying a predicate c[5] in an XPath 
expression. However, such a reference is rare because although the elements c in 
(b, c, d)* and (e, c)* have the same tag name, they are located in different 
parts of the element a, so it is highly likely that these elements c have different 
semantics. Therefore, we decompose it using the intermediate elements to reduce 
the complexity. The above example declaration can be presented by an equivalent 
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group of DTD declarations <! ELEMENT g (b, c, d) >,<! ELEMENT h (e, c)>, 
and <! ELEMENT a (g*, h*)>. The primary decomposition rules of DTD decla- 
rations are: 

1. (b|c)* is decomposed into d* and d = (b|c) 

2. (b,c)* is decomposed into d* and d = (b, c) 

3. b*|c* is decomposed into d|e and d = b*, e = c* 

After the transformation, all declarations can be expressed in a basic form 
<!ELEMENT a (bicri, b 2 (T 2 , . . . , bfcCTfc)>, where a and b^ (i = l..k) are tag 
names and ai is either empty or one of cardinalities and ‘+’. 

Construction of the findPCTAGID function. Since it is desirable to reduce 
the size of the identifiers and make them robust on node insertion, we group the 
child nodes of a node in clusters and assign the same identifier to the nodes in 
a cluster. 

Definition 1. Among the child nodes of a given node, a c-group is the maximal 
group of consecutive sibling nodes having the same tag name. 

We assign an identifier to each c-group of child nodes. To determine the 
identifiers, we reduce the corresponding DTD declaration to its core form by 
removing the cardinalities of sub-elements from the DTD declaration. For ex- 
ample, <! ELEMENT a (b, c, d)> is the core form of the DTD declaration 
<! ELEMENT a (b, c?, d*)>. The tag names appearing in the core form of a 
DTD declaration are numbered by integers starting from one. The index of a 
c-group is the order of the corresponding tag name in the core form of the DTD 
declaration of the parent element. 

Definition 2. The ‘core pctagid ’ of the node n in its parent node p is equal 
to the index of the c-group containing n in p . 

Each XML node is assigned a ‘core pctagid’ that is the index of its c-group. 
Note that the core pctagid is different from the actual index of child nodes 
in their parent node. To determine the actual pctagid, we stretch the core 
PCTAGID to make dependency (2) hold. Initially, the rows in the table StruDTD 
are generated corresponding to pairs of parent-child tags in the DTD and all 
the values pctagid are set to be the core pctagid. Since we want to maintain 
the order of c-groups as it is in DTD, if 3 ai, a 2 , and b such that ai yf a 2 but 
findPCTAGID(ai, b) = findPCTAGID(a2, b), then an integer value is added to 
the core pctagid of all of the child nodes of a2 such that findPGTAGID(ai, b) 
yf findPGTAGID(a2, b). Since the cardinality of any DTD is finite, the stretching 
procedure always stops after a number of steps. 

The StruDTD shown in Table 1 in Section 3.1 is of a variant of OSPIDER. 
Since the core pctagid of the pairs (personnel, person) and (person, person) 
are equal, i.e. three, the core pctagid of all pairs (person, *) were adjusted by 
being increased by one. After that pctagid of (person, person) is equal to four. 
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4.3 Robustness of SPIDER 

SPIDER is robust on structural update since, taking into account the element 
descriptions from DTD or XML schema, it anticipates the positions of updated 
nodes in the associated XML documents. Let us consider two kinds of structural 
updates as follows: 

1. An uncertain occurrence specified by the cardinality ‘?’ of an ele- 
ment in its parent element. In the construction of SPIDER, an identifier for 
such an element is reserved. 

2. The insertion of a node in a group of nodes specified by the car- 
dinalities or ‘-b’ in the declaration of a parent node. For example, 
according to the DTD declaration <! ELEMENT a (b?, c*, d)>, a node a 
may have a number of nodes c. Where SPIDER is used, all these nodes c 
have the same sid, so the insertion of the new node does not affect the 
identifiers of the others. 



4.4 Coding Size 

The interesting feature of SPIDER is that the fanout of findPCTAGID used to 
compute sid does not depend on the maximal degree of the nodes in the actual 
XML trees. In our experiments, even when the DTD was complex, the fanout of 
OSPIDER is relatively small. Moreover, the coding size does not depend on the 
actual size of tags of elements and attributes. 



5 Virtual Joins with SPIDER 

In this section we discuss an application of SPIDER in XML query processing. 
We represent each XML element by a pair [sid, ord] , where sid is the identifier 
generated by SPIDER and ord is an actual order representation of the node in an 
XML instance. Among several presentations available for this purpose, we use a 
3-tuple (docID, startPos, endPos) to express ord, where docID, startPos, and 
endPos have natural meanings. Using this representation, the element (docID, 
startPosi, endPosi) is an ancestor of the element (docID, startPos2, endPos2) 
iff startPosi < startPos2 and endPos2 < endPosi. 

As will be shown later, sid and ord complement each other in query pro- 
cessing. Although the index data of each element or attribute is a combination 
of both sid and ord, the actual disk I/O workload that must be loaded for 
processing a query is small since SPIDER provides a mechanism avoiding many 
intermediate structural joins in XML query processing. We call the mechanism 
VirtualJoin. We will describe the application of VirtualJoin to the process- 
ing of queries of the basic path-predicate, complex path-predicate, and twig types. 
First, we briefly describe several basic functions used by VirtualJoin. Detailed 
descriptions of the functions can be found in [20]. 
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findParentSidAndTag determines the tag name and sid of the parent node of 
a given node. The function calls the functions parentSID and namelD to 
compute the sid of the parent node and pctagid of the pair of these parent- 
child nodes. This computed pctagid is used by the function parentTAG to 
find the tag name of the parent node. 

generateNodePath establishes the full node path for a given node by recursively 
calling the function findParentSidAndTag. 
findAncByName checks the existence of ancestors having a specific tag name of 
a given node specified by its sid and tag name. If such ancestors exist, the 
function returns the lowest ancestor. 



5.1 Basic Path- Predicate Queries 

A simple XML query can be represented by a path expression ending with a 
predicate. We call it a basic path-predicate query. 

Definition 3. An XML query is called ‘a basic path-predicate’ query if it is 
expressed in the form: ai£ia 2 ^ 2 ’ • '^k-is-k or aLi£ia 2 £ 2 ' ' where k > 1, £i 

(i = 1 to k-1) is either the child axis ‘/’ or the self- or- descendant axis and 
V is a predicate of &k ■ 

The predicate V is optional. An example of a basic path-predicate query is 
“person/name/[given = ‘Smith’]”. Using the Structural Join approach, in the 
basic path-predicate queries, all the nodes having the tag names &i, i k, and 
the nodes having the tag name a^ filtered by the predicate V participate in the 
structural joins. Therefore, the index data of all these nodes must be loaded into 
main memory. 

Using SPIDER, for a given node, it is possible to establish its full node path 
using the function generateNodePath, and hence test if the node path matches 
the path ai£ia 2 € 2 ’ • For a basic path-predicate query, VirtualJoin per- 
forms a single scan^ over the set of nodes a^ satisfying the predicate V. For each 
node of the set, it tests if the node a\. matches the path ai£ia 2 ^ 2 - • Note 

that the test may terminate early for disqualified nodes and the testing process 
run totally in main memory. VirtualJoin requires the index data of only a^ 
that satisfy the predicate V being loaded in main memory. VirtualJoin can 
process the ancestor-level joins, where the hierarchy level is required, such as 
“a/b”, “a/*/b”, as well as the “a//b” join. 

We can deduce whether the node path of a node a^ matches the path 
‘ ‘£k-\3.k from the pattern matching problem [21]. However, consid- 
ering that the height of an XML tree is normally low, we proposed a simple 
algorithm, the details of which can be found in [20] . 

Note that VirtualJoin does not require the candidate nodes to be sorted 
and it evaluates basic path-predicate queries without disk I/O except for the 
index data of the output candidate set. 

^ For predicates with a low selectivity, there have been other approaches with a cost 
of an extra index. 
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5.2 Complex Path- Predicate Queries 

A path query may contain a number of selection predicates. 

Definition 4. A query is called ‘a complex path-predicate query’ if it is expressed 
in the form of a finite sequence of basic path-predicate queries separated by the 
child axis ’/’ or the self-or-descendant axis 

A complex path-predicate query Bi£iB 2 £ 2 ’ • k > 1, is evaluated by in- 
tegrating the results of the basic path-predicate queries Bi, i = 1 to k, which 
are evaluated separately using VirtualJoin. Let = {r^} denote the list of 
result nodes of B^, Si = {sP^} denote the list of sid of the nodes in the highest 
hierarchical structure of Bi corresponding to {r^}. From VirtualJoin of basic 
path-predicate queries, it is obvious that {s^} is generated together with {r^}. 
The lists Ri are joined using the conventional structural join technique to pro- 
duce the final result. Two elements and l<z<fc — 1, are matched in the 
structural joins of Ri and Ri+i if both of the following conditions hold: 

1. rj.startPos < start P os && r\_^^.endPos < rl.endPos 

2. {£i is 7’ rj.sid = parentSID(s-_|_;^)||(7 is ’//’ && rj.sid < 

5.3 Twig Queries 

Queries about XML data typically specify elements by selection predicates and 
their tree structure relationship that can be represented as a node label twig 
pattern with or without predicates in the leaf nodes. VirtualJoin processes a 
twig query by decomposing it into subqueries (see Figure 3) that can be evaluated 
as complex path-predicate queries. The result elements of these queries are then 
joined like the joining of the intermediate results presented in 5.2. 

Example 4- For being processed using 
VirtualJoin, the XQuery statement 

FDR $b IN /site/ people/person 
WHERE $b/address/ [city ='Nara’] 

RETURN $b/name/text 

is decomposed into three path-predicate 
queries “/site/people/person”, 

“address/ [city = ’Nara’]”, and 

“namie/text” . 



In general, the disk I/O complexity of VirtualJoin is optimal since only the 
index data of elements that have to be verified by predicates or belong to the 
candidate set of the output is loaded. In Example 4, the index data of person, 
city, and text is needed to perform the joins. 



virtual join 
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predicate 
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Fig. 3. Processing a twig query 




XML Query Processing Using a Schema-Based Numbering Scheme 



31 



6 Experiments 

We implemented VirtualJoin to test the efficiency of this method. The main 
file structure for storing the index data is as follows: 

<scx, element_nEmie , add_infor> 

where sex consists of DSPIDER and ord, element_name is the tag name, and 
add_infor is additional information including an element or attribute indicator 
and a pointer to data. The data is indexed on sex and element_name using 
B+-tree sorted by the sex. ord. startPos. 

We compared our method with the method Stack-Tree- Desc in [12] and Path- 
Stack in [13]. Both of the algorithms use the tuple (docID, startPos, endPos, 
nodeLevel) to present an XML element. Stack-Tree- Desc has the highest per- 
formance among the four methods described in [12], where stacks are used to 
reduce the number of match tests for the pairs of elements from the joined candi- 
date sets in structural joins. PathStack also uses stacks to compactly present the 
partial and total answers to avoid large intermediate answers. For comparison, 
we measured the total time for XML query processing that includes the elapsed 
times for loading the index data and for performing structural joins. 

6.1 Experimental Platform and Data Sets 

Our experiments were coded in Java and conducted on a workstation running 
Windows XP Professional with a 2-GHz CPU. We used the XML data generator 
xmlgen provided by XMark [4] to generate synthetic XML documents, whose 
sizes ranged from 23.4 to 232.2 MB, from the DTD “auction. dtd” . This DTD 
has a fairly complex structure to make the experiments objective. The SAX 
parser available from the Xerces project [18] was used to parse XML data for 
indexing. 

6.2 Experimental Results 

We use queries that featured the various complexities of structural joins. The 
queries contained both ’/’ and ’//’ axes and included short, medium, and long 
location paths. 

Qi: //closed_auction/item 

Q2: //items/name 

Q3: //open_auction//description 

Q4: / /open_auction/ /description/ /list item 

Q5: //open_auction//description/ /keyword 

Qg: //closed_auctions/ closed_auction/ annotat ion/ descript ion/ 
parlist/listitem/text/emph/keyword/ 

The experimental results are shown in Figures 4 and 5, where our method, 
Stack-Tree-Desc, and PathStack are abbreviated by SCX, STD, and PS, respec- 
tively. 
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(a) Elapsed time for Qi (b) Elapsed time 

Fig. 4. Elapsed times for processing short queries Q1-Q3. 



Analysis. The elapsed times for processing simple queries with different data 
sets are shown in Figures 4(a)-(c). Queries Qi and Q 2 both involved a single 
parent-child join whereas the query Q 3 had an ancestor-descendant join. Al- 
though they contained only a structural join, the queries allowed us to compare 
the methods with the candidate sets of different cardinalities. The VirualJoin 
is slightly better than Stack-Tree- Desc in the smallest data set (Figure 4(a)) and 
significantly better in the bigger data sets (Figures 4(b) and (c)). 

Medium complex queries Q 4 and Q 5 , borrowed from [11], contained several 
ancestor-descendant joins. The elapsed times for processing the queries with 
different data sets are shown in Figures 5(d)-(e). As expected, the axis 
could be processed efficiently using VirualJoin. 






(d)Elapsed time for Q4 (e) Elapsed time for Q5 



(f)Elapsed time for Qe 



Fig. 5. Elapsed times for processing medium and complex queries Q4-Q6. 



A number of queries introduced in XMark [4] have complex structures. One 
example is Qe that means: “Print the keywords in emphasis in annotations of 
closed auctions’’^ . The elapsed time for processing the query with different data 
sets is shown in Figure 5(f). For the complex query, the advantage of VitualJoin 
over STD and PS is very significant. This can be explained, for such queries, 
by the amount of index data saved by VirtualJoin from being loaded from 
secondary memory being larger. For the location path of Qg, the index data of 
the only element keyword was loaded and the remaining part of the evaluation 
process was done in main memory. 
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In our query set, the join workloads increased from queries Qi to Qg. In 
all experiments, we could see an interesting tendency that the advantage of 
VirtualJoin steadily increased in proportion to the size of the experimental 
data sets as well as the join workload. 

7 Related Work 

The structural summary has been presented as a graph for semistructured data 
in [7], [14], [15]. A path-based approach to query the XML data by storing all 
available node paths in a table of RDBMS and making queries over the pattern 
of the node paths has been proposed [19]. An integration of XML node numbers 
in query statements and an algorithm for transformation from XPath to SQL 
have been discussed in [5]. 

The presentation of XML elements by a tuple (docID, startPos, endPos, 
nodeLevel) and its variants for processing structural joins have been used in [3], 
[9], [10], [12], [13]. The indexing structures such as B+-tree and R-tree built-in 
in RDBMSs have been exploited in [11] to index the presentation values. 

Some current work is related to our approach [1], [2]. XML documents are 
embedded in a binary tree in [1], the depth of which is high in practice. XR-tree 
proposed in [2] manages the stab lists used to find pairs of the qualified elements 
in ancestor-descendant structural joins. The index implementation requires a 
new data structure other than the widely used B+-tree and does not support 
the parent-child relationship well. 

Another related work is the UID method [8] that assigns consecutive integers 
starting from 1 to the nodes of an XML tree in order from top to bottom and 
from left to right in each level, assuming that each of internal nodes has the 
same fan-out equal to the maximum fan-out of the nodes. The design of SPIDER 
is compact and robust on structural update, the issue that limits the application 
of the UID method in query processing. 

Although independently developed, our approach has some similarity with a 
recent work [22] in using two indices to perform XML query processing. 



8 Conclusion 

Existing techniques for processing XML queries do not efficiently utilize the doc- 
ument structure described in the DTD or XML schema. In this study, we pro- 
posed VirtualJoin, a mechanism that enables structural joins based on SPIDER, 
a numbering scheme that incorporates both structure and tag name information 
extracted from DTD or XML schema. VirtualJoin can avoid the disk I/O for 
the index data of the intermediate nodes in the structure of an XML query. Ac- 
cording to our experiments, the disk I/O improvement of VirtualJoin increases 
in proportion to the structural join workload and the sizes of data sets. 

In future, we plan to investigate the application of VirtualJoin to querying 
XML data stored in an RDBMS. 
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Abstract. In this paper, we describe an approach to boosting the performance 
of an XQuery engine by identifying and exploiting opportunities to share proc- 
essing both within and across XML queries. We first explain where sharing op- 
portunities arise in the world of XML query processing. We then describe an 
approach to shared XQuery processing based on memoization, providing de- 
tails of an implementation that we built by extending the streaming XQuery 
processor that BEA Systems incorporates as part of their BEA WebLogic Inte- 
gration 8.1 product. To explore the potential performance gains offered by our 
approach, we present results from an experimental study of its performance 
over a collection of use-case-inspired synthetic query workloads. The perform- 
ance results show that significant overall gains are indeed available. 



1 Introduction 

XQuery [18], while not yet a standard, is already being put to use in commercial 
software infrastructure products for a number of different IT purposes. For example, 
the XQuery language (and its sub-language XPath) has been incorporated into several 
products for business process management and application integration. XQuery is 
used in several ways there - as a transformation language for defining XML data 
transformations, as an expression language for making branching and looping deci- 
sions based on XML workflow variables, and as a filtering and routing language for 
handling message broker events. XQuery is also being used in enterprise information 
integration products that provide virtual XML views of disparate enterprise data 
sources where it is the language for defining integrated views and writing queries. 

As XQuery adoption gains momentum, the performance of XQuery processing 
becomes increasingly important. As with any query language, XQuery is amenable to 
a large number of optimizations, both at compile time and at runtime. In many of the 
uses to which XQuery is being put, significant optimization opportunities can be 
obtained through the discovery and exploitation of shared processing, within or 
across queries. For example, in publish/subscribe, query evaluation work can be 
shared when matching messages against a large number of subscriptions [5]. In this 
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paper, we investigate the exploitation of such sharing opportunities to boost the per- 
formance of XQuery processing. In particular, we develop memoization techniques 
for XQuery and apply them in the context of a commercial streaming XQuery proces- 
sor. 

Sharing in XQuery processing. Intuitively, intermediate results of XQuery proc- 
essing can be shared whenever the “same” XQuery expression(s) would otherwise be 
evaluated more than once with the “same” XQuery variable bindings. (We will say 
more about what “same” means in this context in Section 2.) This can happen in sev- 
eral ways: 

1. The same expression can occur several times in different locations within a 
query. 

2. The same expression can occur in different queries that are evaluated together. 

3. An expression can occur within a query that is evaluated multiple times (most 
likely with different variable bindings). 

4. An expression can occur in different queries that are executed at different times 
(where the query context is the same across executions). 

The first case is self-explanatory. The second case arises in contexts like pub- 
lish/subscribe, where an incoming XML message needs to be checked against many 
subscription queries. An example of the third and fourth cases is a web service call or 
a remote database lookup modeled as an XQuery function call, where the results of 
the call are known to be stable over time (at least for some specified time period). 

In this work, we propose a memoization-based approach to avoiding redundant 
work. Memoization caches the results for an expression based on its variable bind- 
ings, and it can thus support evaluation reuse in all of the above cases. 

Streaming XQuery processing. Our approach is designed to work well in the 
context of an XML query processor that employs stream-based processing. In the 
context of XML query processing, streaming is important for performance, and it can 
occur at a fine level of granularity. A fine-grained approach is critical given that a 
single XML item can be arbitrarily large, containing the equivalent of an entire ta- 
ble’s or even database’s worth of data content. To enable fine-grained streaming, the 
BEA XQuery engine [6], the engine on which this work is based, represents its XML 
operands as sequences of (potentially nested) tokens that represent smaller constituent 
data pieces. 

The use of a token stream representation of XML provides an XQuery processor 
with several ways to achieve incremental query evaluation while avoiding the materi- 
alization of its inputs. The first way is pipelining. A given XQuery expression can 
consume and produce token streams incrementally, materializing only one or a few 
tokens at a time in order to compute and emit its output. Of course, this requires the 
use of a pull-based API to be truly effective. The second way is lazy evaluation, a 
technique commonly used in the implementation of functional programming lan- 
guages [11]. With this technique, a result is not actually generated until requested by 
a consuming expression. Moreover, in XQuery, some expressions can be evaluated 
based on only the first few tokens of a given input - for example, nth( ), empty( ), 
exists( ), existential comparators, and positional predicates. These expressions enable 
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an even lazier mode for XQuery processing, where only those (possibly few) tokens 
needed for the consuming expression are generated. 

Contributions. In this paper, we present a memoization-based approach to sharing 
in XQuery processing. While both the multiple-query processing (MQP) problem 
[17] and the use of memoization for query processing [10] have been explored in 
other contexts, our contributions lie in the fact that shared XQuery query processing 
in a streaming environment adds significant new wrinkles to the problem. In par- 
ticular, MQP in the relational setting has focused on SELECT-FROM- WHERE style 
constructs, whereas our work is aimed at supporting sharing for the much richer 
XQuery language. Memoization has been exploited for expensive functions (as in 
query processing) or repeatedly computed functions (as in dynamic programming), 
but it has not been studied for a large variety of XQuery expressions and in a stream- 
based processing environment. The main contributions of this work can be summa- 
rized as follows: 

1. We set the scope for XQuery memoization, first in a simple but limited way, 
and then in an expanded range exploiting semantic data and expression 
equivalence. 

2. We develop a number of query compilation techniques to identify interesting 
shareable XQuery expressions and to determine the granularity of memoiza- 
tion. 

3. We also extend the runtime system, resolving the inherent tension between 
stream-based processing and memoization. Our solutions enable computation 
reuse while supporting pipelining and avoiding eager evaluation. 

4. We summarize results from a performance study of our techniques in the con- 
text of the BEA XQuery engine. The results show significant performance 
gains for typical use cases of XQuery. 

5. As this paper represents our initial approach towards adding memoization to 
XQuery processing, we identify several important open problems to be ad- 
dressed. 

The paper is as follows. Section 2 discusses basic issues related to XQuery memoi- 
zation. Section 3 describes the BEA XQuery engine, the technical context of this 
work. Sections 4 and 5 describe how we have added memoization to this engine, 
focusing on the compile-time and runtime aspects, respectively. Section 6 reports 
experimental results. Section 7 covers related work. Section 8 concludes the paper. 



2 Basics of XQuery Memoization 

In this section we address the basic issues related to XQuery memoization. Memoiza- 
tion is an algorithmic technique that remembers the results returned by functions 
invoked with particular arguments and, if the function is called with the same argu- 
ments again, returns the result from memory rather than recalculating it [11, 13]. In 
the context of XQuery, the unit of computation that we adopt for memoization is the 
(XQuery) expression. 
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In its simplest form, XQuery memoization can be implemented in a straightfor- 
ward way: results of an expression are shared whenever an identical XQuery expres- 
sion is evaluated more than once with identical XQuery variable bindings, where the 
meaning of "identical” is based on bit-wise comparison of their binary representa- 
tions.' The usefulness of such memoization, however, is limited by the stringent re- 
quirement of identical binary representations. To expand the scope for XQuery 
memoization, we would like to establish equivalence relationships between expres- 
sions and between variable bindings, so that ample reuse of the computed results is 
possible. To this end, we relax the conditions for the application of memoization 
along two dimensions: (1) when two XQuery data model instances can be determined 
to be equivalent; and (2) when two XQuery expressions can be determined to be 
equivalent. 

XML Data Equivalence. What we seek is an equivalence relationship on XQuery 
data model instances that meets the following requirement for safe memoization: 
Given an expression E, for every pair of equivalent XQuery data model instances, the 
two results of evaluating E on the two instances are also equivalent. XQuery has 
multiple equality testing predicates (=, eq, is, deep-equal( ), ...) to compare data 
model instances. Unfortunately, none of these is satisfactory for establishing data 
equivalence for safe memoization (the analysis is omitted here in the interest of 
space). As a result, we define our own, more comprehensive (but still imperfect) 
equivalence relationship between XQuery data model instances: 

Definition 1 (XML data equivalence). Two data model instances are equivalent ijf 
one of the following conditions is true: (a) they both represent the empty sequence, or 
(b) they are both single atomic values, their primitive values are equal (based on the 
eq comparison on their respective primitive XML data types), and their type annota- 
tions are also equal (based on the eq comparison on their xs'.QName data types), or (c) 
they are both nodes and they compare true via the is comparison, or (d) they are both 
sequences of the same length / > 1 and the corresponding items in the ith position 
(l<=i<=0 are equivalent via the conditions (b) or (c). 

Unfortunately, memoization based on this definition is not safe for every possible 
XQuery expression. Eor example, consider the memoization of the string^ ) function 
for two dateTime instances that use different time zones but have the same normal- 
ized values (i.e., the same UTC time). Based on Definition 1, the two dateTime in- 
stances are equivalent (via condition (b)), however, the results of applying string( ) to 
these instances are not equivalent, due to the different time zones included in the 
output strings. As such, memoization in this case would cause erroneous results. 

Given our goal of exploiting semantic data equivalence for memoization and the 
fact that doing so correctly for the full XQuery language is a very hard problem, our 
current solution is restricted to a subset of XQuery for which Definition 1 is guaran- 
teed to provide safe memoization. Roughly speaking, every expression in this subset 
is such that each variable of the expression satisfies one of the following conditions: 



' Of course, expressions that produce non-deterministic results are not suitable for memoiza- 
tion. Examples of such expressions include functions that read the system time (e.g., 
fn:current-dateTime { )), and user-defined functions that are declared to be variant. 




Implementing Memoization in a Streaming XQuery Processor 39 



(1) the type of the free variable is a node; (2) the type of the free variable is an atomic 
type and the computation performed by the expression is compatible with the eq 
comparison defined on this type; or (3) the type of the free variable is a sequence of 
items and the items in the sequence have the same type that satisfies condition (1) or 

(2) . As our experimental results show, even this limited definition of equivalence can 
provide significant performance improvements for XQuery processing in use cases 
similar to those that we would expect to see in web services, application integration, 
etc. 

Expression equivalence. In general, two XQuery expressions are the same if and 
only if they return the same result for every correct binding of their variables. This 
question is undecidable in general, since XQuery is Turing-complete. As a result, we 
identify sufficient conditions for XQuery expression equivalence based on expression 
normalization and detection of syntactical equivalence between normalized expres- 
sions (which will be described in detail in Section 4). As our experimental results 
show, these conditions permit ample reuse of computations (given typical use cases) 
while also being efficiently computable. 

Thus, our approach represents a practical compromise between an overly restric- 
tive definition of XQuery memoization and the difficulties that arise due to the fully- 
general nature of XQuery. While we believe that our approach is applicable in a large 
number of practical situations, expanding its range is part of our ongoing work. 



3 The BEA Streaming XQuery Processor 

In this section we review the aspects of the BEA XQuery engine [6] that are directly 
relevant to our subsequent design descriptions. The representation of XML data used 
internally by the BEA XQuery runtime is a sequence of tokens called the token 
stream. Despite its similarity to the SAX API, the token stream models typed XML, 
and is accessed via a pull-based API for producing and consuming tokens lazily. 
Moreover, the BEGIN tokens for documents, elements or attributes are augmented to 
carry the ids of the nodes in order to compare nodes for both equality and document 
ordering. Eurther details on the token stream can be found in [6]. 

3.1 Query Compilation 

The purpose of the XQuery compiler is to parse, verify, type check, normalize and 
optimize a query. The result of compilation is an iterator tree that can be interpreted 
by the runtime system. 

During compilation, a query is represented as an expression tree. Nodes in an ex- 
pression tree represent kinds of expressions and edges represent data flow dependen- 
cies. The kinds of expressions used by the BEA XQuery processor are very close to 
the W3C XQuery formal semantics recommendation [20], and include Constants, 
Variables, FirstOrder expressions, SecondOrder expressions, IfThenElse, Node Con- 
structors, etc. All built-in functions and operators of the XQuery standard [19] share 
the same representation - each is a FirstOrder expression. Examples include all 
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XPath axes (e.g., child, descendant, parent), the data{ ) function, arithmetic functions, 
comparisons, and constructor functions for simple types (e.g.,float ( ), string^ )). The 
Match expression carries out an XPath nodetest (i.e., kind of node and name of node). 
A Node Constructor creates a new XML structure. In this particular processor, node 
id generation is decoupled from Node Constructors and postponed until later when 
needed for query evaluation.^ The family of SecondOrder expressions can be further 
classified into Map, Let, Sort, etc., most of which represent the high-order functions 
of XQuery. Map will be described more closely in the example below. 



Sline 



Map: 



|Match:OrderLine| 

I ^ 

I FO:children I 

I Match:Order ~[ 
I FO:children I 

X 

()"$doc ) 




Fig. 3-1. Expression tree of query Ql. 



The translation of an XQuery into an expression tree closely follows the W3C 
XQuery formal semantics [20]. For example, the /or clause of a FLWOR query is 
translated into nested Maps, each of which defines one variable; the where clause into 
an IfThenElse expression; and so on. Consider query Ql below, which requests line 
items in a purchase order document that have a particular seller. 

Query Ql: for $line in $doc/Order/OrderLine 

where xs:integer{data{%\mdS>e\\ev&YD)) eq 1 
return <LineItem> { $line/Item/ID } </LineItem> 

Fig. 3-1 shows the expression tree for this query, where constant expressions are 
shown as rounded rectangles, variables as ovals, and all other expressions as rectan- 
gles. Expressions other than constants and variables are labeled with the kind of ex- 
pression (“FO” for FirstOrder), followed by a colon, followed by any optional speci- 
fications of an expression, such as a particular FirstOrder function (e.g., children{ ) or 
datai )) as well as any other parameters (e.g., the NodeTest of a Match expression). 
Map is the only SecondOrder expression in this example. Note that the left child of 
the Map expression is labeled with the name of the variable defined in the Map. Uses 
of the variable in the right child of the Map are denoted by shaded ovals. 

A free variable of an expression is a variable that is not defined by any second or- 
der expressions inside this expression, essentially representing an input of the expres- 
sion. For example, “$doc” is a free variable of the Map, but “$line” is not. Flowever, 
“$line” is indeed a free variable of IfThenElse, the right child expression of Map, and 



^ This decoupling raises the potential for sharing the node construction computation. 
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of the expressions inside IfThenElse that contain this variable. We call a mapping 
between the free variables of an expression and a set of values a binding. 

At the last step of query compilation, the compiler generates code for the query 
expression, resulting in an executable iterator tree. There is a one-to-one mapping 
between many of the nodes in the expression tree and iterators in the generated itera- 
tor tree; however, a few iterators implement several expressions (e.g., the children{ ) 
iterator implements a child{ ) expression plus a Match expression) for performance 
reasons. Variable expressions are implemented by a special runtime variable iterator 
that returns the value of a variable and can be bound to different inputs at runtime. 

3.2 Runtime System 

The task of the runtime system is to interpret an iterator tree to produce the query 
result. Like many database query engines, the BEA XQuery runtime system is based 
on an iterator model [9]. Its query execution model is pull-based, and data is con- 
sumed at the granularity of tokens. Using the iterator model, the runtime system natu- 
rally exploits pipelining. It also makes use of lazy evaluation; that is, an iterator only 
generates results on demand, with each next{ ) call. Consider the iterator that imple- 
ments the emptyi ) function. This iterator consumes only a single token from its input 
in order to produce a Boolean result. The remaining input tokens are not consumed, 
and thus are not even generated. Other expressions where lazy evaluation is effective 
include positional predicates (e.g., $line/Item[l]) and existential quantification. 



4 Query Compilation for Memoization 

In this section we describe the compile-time aspects of arranging for efficient evalua- 
tion of a set of XQuery queries (referred to as the “target queries”). We restrict our 
attention to the cases where the target queries share the static context and most of the 
dynamic context (except the date and time of execution).^ The techniques presented in 
this section focus on sharing among common subexpressions. Although not discussed 
below, sharing by caching expensive methods [10] can be easily supported by indi- 
cating such expressions to the compiler. 

4.1 Expression Equivalence 

Our approach to expression equivalence is based on two steps. Eirst, all expressions 
are normalized by applying a set of rewriting rules. The rewriting rules include ones 
that “normalize” queries based on the XQuery formal semantics [20], and others that 
are typically applied in XQuery optimization, e.g., unnesting nested FLWOR expres- 
sions whenever possible, putting predicates in conjunctive normal form, etc. More 
details of these rewriting rules are provided in [6]. 



^ XQuery memoization in the presence of different static and/or dynamic contexts poses diffi- 
culties that are beyond the scope of this paper; we leave that generalization for future work. 
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The second step searches for syntactical equivalence between expressions. Our ap- 
proach to determining equivalence is based on the notion of variable renaming sub- 
stitutions. We say that two expressions El and E2 are syntactically equivalent via a 
renaming substitution 5={x/yj,....,xyy_,}, if {Xj,...,x_,} are the free variables of El, 
(Yi, ...,y„} are the free variables of E2, and El and E2 are syntactically isomorphic, 
up to a renaming of (free and bound) variables. Eor example, the expressions 
($xh-1)*($x-3) and ($yH-l)*($y-3) are syntactically equivalent via the renaming sub- 
stitution {$x/$y}. 

Given this definition, we develop an algorithm that detects syntactical equivalence 
between two expressions El and E2. If El and E2 are syntactically equivalent via the 
renaming substitution 5'={x,/y,, ...., xyy„}, the algorithm returns S, otherwise it re- 
turns null. In the sequel, we call this algorithm “equals( )”. 

Given two input expressions El and E2, equals) ) iterates on El and E2 and their 
subexpressions from top to bottom, checking recursively at each level for syntactic 
isomorphism. Obviously two expressions are not (syntactically) equal if they are not 
of the same kind (e.g., constants, variables. Maps, etc). Moreover, it is clear that the 
details of the recursive algorithm depend on the kinds of the expressions El and E2. 
XQuery has more than 15 kinds of expressions. While our algorithm handles all of 
them, for brevity, here we describe only three: 

Constant expressions. If El and E2 are both constant expressions, then they are 
equal via a renaming substitution S iff the given constants are equivalent via the data 
equivalence Definition 1 given in Section 2. 

FirstOrder expressions. If El and E2 are both FirstOrder Expressions, then they 
are equal via the renaming substitution S iff^ihey have the same operator and the same 
number of children subexpressions, and the children subexpressions are pairwise 
equal via the same substitution S. 

Map expressions. Assume that both El and E2 are Map expressions of the form: 
El=/or $varl in exprl return expr2 
¥2= for $var2 in expr3 return expr4 

First, the algorithm will test the structural equivalence of expl and expr3. If this suc- 
ceeds with renaming substitution S then the algorithm continues; otherwise it fails. In 
the positive case, the renaming {$varl/$var2} is added to the current renaming sub- 
stitution S and the algorithm will continue by testing the structural equivalence of 
expr3 and expr4 via the new S. In case the test succeeds and an augmented substitu- 
tion S is returned, the end result of the test is the substitution S without the renaming 
of the internal variables {$varl/$var2}. Otherwise the test fails. 

Note that the complexity of the structural test equals) ) is linear in the size of the 
input expressions. Given that the potentially interesting expressions for sharing 
among the target queries include all the subexpressions of these queries, the structural 
test needs to be applied to all possible pairs in the Cartesian product of the subexpres- 
sions of the target queries, yielding a very expensive algorithm. Next, we describe the 
technique used in our implementation to avoid this exponential complexity. 
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4.2 Applying the Algorithm 

Identifying common subexpressions is implemented as an additional step taken by the 
compiler after query parsing, normalization and optimization, but before code gen- 
eration. In this step, the compiler iterates over the target queries and identifies com- 
mon subexpressions both inside each query and between this query and earlier que- 
ries. 

For each query, the compiler performs a depth first search in the query expression 
tree to identify the maximal shareable subexpressions: For each subexpression en- 
countered that is not a constant or a variable, the compiler performs three tasks: (1) 
Apply hashing on the subexpression, ignoring all the variable names, and use the 
hashing result to probe the in-memory storage of all distinct subexpressions, each of 
which serves as a representative of an equivalence class. (2) If representatives with 
the same hashing result exist, for each of them call equals( ) on the representative and 
the subexpression in hand. (3) If any representative is equivalent to the subexpression, 
apply heuristics to filter out uninteresting cases of common subexpressions (such as 
inexpensive operations, e.g., a simple addition, and expressions that are not very 
expensive but could return large results e.g., a child path expression with wildcards). 
If a representative passes all these tests, the compiler determines that the representa- 
tive matches the subexpression, and stops further traversal into this subexpression. 
Otherwise, it updates the storage of equivalence classes with the unmatched subex- 
pression and continues the search in the children of this subexpression. 

As an example, consider query Q1 from Section 3 together with Q2 given below. 

Query Q2: for $item in $doc/Order/OrderLine 

where xs:integeridata{%\teml^\xytvs\Y))) eq 8 
return <LineItem> { $item/Item/ID } </LineItem> 

After query Q2 is parsed, normalized, and optimized, it is represented by an ex- 
pression tree similar to that in Fig. 3-1 except for the if expression (i.e., the leftmost 
branch of IfThenElse). Table 1 shows the results of the compilation actions applied to 
the expression tree of Q2 for identifying maximal shareable subexpressions after all 
subexpressions of Q1 have been processed. The rows contain the subexpressions 
considered in order of the depth first search. As Table 1 shows. Map is filtered by the 
first step of hashing because it contains a different path expression and a different 
constant in its if expression. Match: OrderLine and NodeConstr are the two maximal 
common expressions identified. Note that although FO:children and FO:( ) are 
equivalent via equals( ), they are filtered by our heuristics as being uninteresting 
sharing cases. 

The implementation of code generation is also modified to take into account the 
identified common subexpressions. The compiler again iterates over the query set in 
the same order. For each query, code generation proceeds recursively in the expres- 
sion tree as before, except for common subexpressions. For the instances of a com- 
mon subexpression, a Cachelterator (which will be described in detail in the next 
section) is created for each instance, but all such Cachelterators point to the same 
memo table, which is where the results of memoization are cached at runtime. 
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Table 1. Identifying Common Subexpressions between Q1 and Q2 



Expression 


(1) hashing 


(2) equals 


(3) heuristics 


Map 


No 






Match: OrderLine 


Yes 


Yes 


Yes 


IfThenElse 


No 






FO:eq 


No 






Cast:integer 


No 






FO:data 


No 






Match: SellersID 


No 






FO:children 


Yes 


Yes 


No 


NodeConstr 


Yes 


Yes 


Yes 


FO: ( ) 


Yes 


Yes 


No 
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I ElementConstr I 
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Query Q2 ' ^ 



Fig. 4-1. Iterator trees of two queries containing common subexpressions 



Fig. 4-1 shows the iterator trees generated for Q1 and Q2. The two queries have 
separate iterator trees. The structure of each iterator tree is similar to the expression 
tree (revisit Fig. 3-1), with abstract expressions replaced by specific iterators and 
some optimizations of the structure (e.g., merging Match and children). Moreover, 
instances of a common subexpression identified previously (as shown in Table 1) are 
wrapped by different Cachelterators that point to the same memo table. 



5 Query Execution for Memoization 

Flaving presented the compile-time aspects of our solution, we now describe the ex- 
tensions required for the runtime system. These extensions are encapsulated into a 
new iterator called Cachelterator. The key data structure used by the Cachelterator is 
the memo table, which maps from the set of XQuery data model instances bound to 
the free variables of an expression to the computed/cached result of that expression. 
The main challenges in the memo table implementation, addressed here, stem from 
the inherent tension between memoization and streaming XQuery processing. 
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The first challenge is to obtain the “identity” of the bindings of the free variables, 
so that this identity can then be used as a key to probe the memo table. If the binding 
of a free variable is a sequence of items, the entire sequence needs to be read before 
the identity can be computed and any result produced, which unfortunately breaks 
pipelining. In addition, with lazy evaluation, the binding of a free variable which is 
provided by a subexpression may not yet be evaluated at the point when this key is 
needed. How to perform memo table lookup in the face of these conflicts is crucial 
for computation reuse without losing the performance benefits of stream-based proc- 
essing. Our solution to this is presented in Section 5.1. 

The second challenge relates to the output. Recall that a very lazy mode of execu- 
tion is used for XQuery processing. That is, the result of an expression is produced 
token by token, upon request, rather than in its entirety. Result caching, however, 
works best if the whole result is pre-computed because it is unknown how many to- 
kens of the result are to be consumed by the different consumers. We refer to the kind 
of caching that pre-computes the entire result as Complete Caching. In contrast. Par- 
tial Caching does not pre-compute entire results; it caches those parts of the results 
that have been requested by the consumers. Partial Caching is favorable from a per- 
formance perspective, but it is more difficult to integrate into a stream-based XQuery 
processor. These two caching schemes are described in more detail in Section 5.2. 

5.1 Memo Table Lookup 

The purpose of a memo table is to map the values of the free variables of a common 
subexpression to a (completely or partially) cached result. To this end, the memo 
table is implemented as a hierarchy of hash tables. Each level in this hierarchy corre- 
sponds to one free variable and is probed using the value of that free variable. Prob- 
ing a memo table with a set of values bound to the free variables results in either a 
reference to the cached result (i.e., a memo table hit) or a null pointer (i.e., a memo 
table miss). 

In order to probe the memo table and to record new entries in the memo table in 
the event of lookup misses, it is crucial to know the values of the free variables. As 
stated at the beginning of this section, a naive implementation of memo table lookup 
could break lazy evaluation and pipelined processing of these values, thus adversely 
affecting the performance. At this point, we place an important restriction on the 
notion of data equivalence used for memo table lookup: we disregard condition (d) in 
Definition 1 (given in Section 2) for establishing equivalence between sequences of 
items. In other words, we only cache results of an expression if the type of each free 
variable of this expression is either an atomic type (e.g., integer or date) or a node 
(e.g., element or document). We do not cache results if the type of a free variable is a 
sequence of items. Our implementation of this restriction is based on type checking 
on free variables at compile time. 

The issue with lazy evaluation still exists even with this restriction. The value of a 
free variable can contain an arbitrarily large number of tokens (e.g., for a document), 
which might be produced by another complex expression that we wish to evaluate 
lazily. Fortunately, this restriction does enable us to probe the memo table by only 
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looking at the first token of the value of each free variable. If the type of the free 
variable is a node, the first token will contain the id of the node and we can use this id 
to probe the memo table. If the type of free variable is an atomic type, instead, we can 
extract the whole value from the first token and use that to probe the memo table. 

5.2 Result Caching 

Implementing complete result caching is easy, once the memo table lookup issue has 
been solved. The queries that contain a common subexpression each instantiate a 
Cachelterator for this common subexpression. The Cachelterators of those queries, 
however, share the same memo table. This memo table is used in the following way: 
For each binding of the free variables, the memo table is probed as described before. 
If the result has been cached, the Cachelterator returns it token by token (whenever its 
next{ ) method is called) from the cached result. If the memo table lookup fails, the 
common subexpression is fully evaluated using the current binding, the entire result is 
cached, and the memo table is updated with the (binding, result) pair. The next time 
when the common subexpression needs to be evaluated with the same binding of the 
free variables (within the same query or for another query), the stored result is reused. 

Complete Caching is simple, but it computes the entire result of a common subex- 
pression which may never be needed. For performance reasons, we would like to 
compute the results of a common subexpression just as lazily as in other situations; 
that is, results stored in the memo table are generated only when they are needed by 
the consumers. This gives rise to the idea of Partial Caching that is able to cache par- 
tial results across queries and compute additional parts of the results later, if needed. 

To implement Partial Caching, we need to keep the iterator tree of the common su- 
bexpression in addition to the partial results. We will use this iterator tree when addi- 
tional results are needed which have not been produced yet. Furthermore, we must 
preserve the state of all iterators in the iterator tree, in particular, the bindings of the 
iterators that represent the free variables in the iterator tree. In general, preserving 
such states across queries can be costly and may involve further materialization. 

We currently focus on a special case, for which preserving the state is relatively 
simple; that is, the common subexpression is resumable. An expression is resumable 
if its free variables are bound only once in one invocation of the query execution. 
This condition can be checked at compile time based on the static types of the expres- 
sions that compute the values of the free variables. Common examples of such ex- 
pressions occur in web services where path expressions are prefixed with an external 
variable that will be bound to each incoming message. For a resumable expression, 
we simply store the iterator tree together with the partial results in the memo table and 
use the iterator tree whenever additional results need to be produced (with guaranteed 
correct state of the iterator tree). Finally, some support is also provided to prevent a 
query from closing its iterator tree that has been stored in a memo table at the end of 
its processing. Details are omitted here due to space constraints. 
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6 Performance Evaluation 

We have implemented our techniques for shared XQuery processing in a Java-based 
prototype system extending the BEA XQuery processor. In this section we evaluate 
the effectiveness of these techniques in the context of message brokering, where que- 
ries are executed over XML message payloads (or messages, for short). The work- 
loads are derived from use cases collected from BEA customers. These use cases 
demonstrate common ways of using XQuery in practice. 

Starting from these use cases, we created a set of workload queries based on the 
Purchase Order schema from the Universal Business Language (UBL) library [15]. 
We also created purchase order messages using a tool based on the AlphaWorks 
XML generator [1]. We used a set of 1000 10KB messages in our experiments. The 
performance metric used is Query Execution Time. This is the average time for exe- 
cuting a set of queries on each message from the input set. It does not include the 
message parsing time. We compared the performance of individual execution of the 
queries (“no caching”) and query execution with memoization (“caching”). Complete 
caching and partial caching perform the same, if not otherwise stated. All the experi- 
ments were performed on a Pentium III 850 Mhz processor with 768MB memory 
running Sun JVM 1 .4.2 on Linux 2.4. The JVM maximum allocation pool was set to 
500 MB. All data structures including the memo tables fit in the 500 MB allocation 
pool. 

The first experiment was conducted in the context of subscriptions using param- 
eterized queries, which is a common way that service instances subscribe to a mes- 
sage broker. The parameterized query that we used is provided below: 
for $price in $doc/Order/OrderLine/Item/BasePrice/PriceAmount 
where float{data{%pncs)) It Svalue 
return $price 

Thirty queries (i.e., subscriptions) were generated from this template. To obtain dif- 
ferent degrees of query similarity, we varied the number of distinct values that the 
variable $value can take, called the domain size, from 1 to 30. For a given domain 
size n, the values between 1 and n were evenly distributed in the thirty queries. 

The results are shown in Fig. 6-1. It is clear that memoization provides huge bene- 
fits for this workload. When all thirty queries use the same value, “caching” achieves 
a 10. 7x performance gain. As the domain size increases, its performance benefit de- 
creases slowly, obtaining a factor of 3.3 when every query uses a distinct value. 

The next experiment focuses on the use case of message transformation for sub- 
scribers. In this case, the message broker provides a function called summarizeOrder- 
Line (not shown here) for summarizing the line items of interest to each subscriber. 
An example subscription query is illustrated below. 
for $line in $doc/Order/OrderLine 

where xs:integer{data{%\md\l&ml\D)) ge 1 and xs:integer{data{%\me,l\\e.ml\T))) le 20 
return summarizeOrderLine ($doc, $line) 
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Fig. 6-1. Vary the number of distinct 
values used for the parameterized query 




Selectivity of the Union of the Predicates 



Fig. 6-2. Vary the selectivity of 
the union of the query predicates 



In this test, we used five queries similar to the query above and fixed their selectivity 
to 20% each using a range predicate on the ID of line items. We varied the overlap 
among queries by changing the constants in the range predicates so that the selectivity 
of the union of the 5 predicates varied from 100% (no overlap) to 20% (full overlap). 

With query overlap, the same OrderLine may satisfy multiple predicates, and 
memoization over the function summarizeOrderLine( ) can avoid redundant work 
among queries. This type of sharing, however, cannot be captured by the traditional 
plan-level sharing approach of the relational world [17]. Here, plan sharing is 
equivalent to individual execution of queries. The results of this experiment are 
shown in Fig. 6-2. It can be seen that as the overlap among queries increases, the 
performance of “caching” improves remarkably but that of “no caching” (or plan 
sharing) does not. 

We also conducted experiments to evaluate the effectiveness of our approach for 
web services calls and message routing using path expressions, and to compare com- 
plete and partial caching for workloads where lazy evaluation is crucial. The results 
of these experiments show that significant overall performance gains are available. 



7 Related Work 

Our work is related to a number of areas in the databases and functional programming 
communities. We attempt to provide a rough overview of related work here. 

Query execution techniques based on sorting or hashing have been used to elimi- 
nate redundant computation of expensive methods [10]. For SQL trigger execution, 
algorithms have been developed to extract invariant subqueries from triggers’ condi- 
tion and action [8]. These techniques used in the relational setting are related to ours 
for XQuery processing in the case of single expression, multiple bindings. 

In the pioneering work on multi-query optimization [17], the problem of MQO is 
formulated as “merging” local access plans of a set of queries into a global plan with 
reduced execution cost. Along this line, advanced algorithms have been proposed to 
approximate the optimal global plan [4, 16]. Our work addresses XQuery— a much 
richer language— and uses different techniques for identifying common subexpres- 
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sions. Moreover, as our results show, our work provides finer-grained (i.e., binding- 
based) sharing of intermediate results than merging relational-style access plans. 

There has been a large body of work on materialized views, which precomputes 
the views used by a set of queries and stores the results to speed up query processing 
[7]. In contrast, our techniques consider expressions with parameters, and cache and 
recover expression results “on the fly” while running a set of queries. View selection 
[1, 3, 14] decides what subqueries to materialize based on cost models and/or reliable 
statistics, similar in spirit to our technique of selecting interesting shareable computa- 
tions. Using partial result caching, our approach has the advantage of avoiding un- 
necessary materialization over the cached computations. 

XQuery memoization differs from memoization in functional programming [11, 
12] in two aspects. First, instead of named functions, the memoization target for 
XQuery is expressions, making effective detection of shareable expressions critical. 
Second, while memoization in functional programming is usually based on a simple 
“values in, value out” execution model, our approach is realized in a complex proc- 
essing model that produces results lazily and pipelines them to the extent possible. 



8 Conclusions 

In this paper, we described a memoization-based approach to sharing in XQuery 
processing. We first provided an analysis of data and expression equivalence for 
XQuery memoization. We then addressed issues related to efficient implementation. 
We developed a number of query compilation techniques to identify interesting 
shareable expressions, and extended a pipelined runtime system with efficient memo 
table lookup and different caching schemes. All of these techniques have been im- 
plemented in the context of a commercial XQuery processor and their effectiveness 
was demonstrated using query workloads derived from the common uses of XQuery 
in practice. 

While our results are promising, we view this as a first step towards solving the 
problem of efficient sharing in XQuery processing. There are many interesting and 
important problems to be addressed in our future work. First, as data and expression 
equivalence is crucial to the applicability of memoization, a thorough analysis in the 
context of the full XQuery language will be a main focus of our work. Second, we 
will consider normalization rules that rewrite queries especially to enable memoiza- 
tion. Such rewriting is aimed at turning variables of an expression from node-based to 
value-based, thus expanding the opportunities for reusing the computed results. Third, 
a complete solution to partial result caching that supports lazy evaluation requires 
further investigation. Finally, due to the diverse characteristics of shareable expres- 
sions, selective memoization is key to the performance of memoization. We plan to 
extend the set of compile-time heuristics and develop runtime adaptability to select 
those shareable expressions beneficial from the cost point of view. 
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Abstract. We are presenting a coherent framework for XQuery process- 
ing that incorporates IR-style approximate matching and allows the or- 
dering of results by their relevance score. Our relevance ranking algorithm 
is based on both stem matching and term proximity. Our XQuery pro- 
cessor is stream-based, consisting of iterators connected into pipelines. In 
our framework, all values produced by XQuery expressions are assigned 
scores, and these scores propagate and are combined when piped through 
the iterators. The most important feature of our evaluation engine is the 
use of structural and content-based inverse indexes that deliver data in 
document order and facilitate the use of efficient merge joins to evalu- 
ate path expressions and search predicates. We present the rules for the 
translation from a large part of XQuery to iterator pipeline. Our mod- 
ular approach of building pipelines to evaluate XQuery scales up to any 
query complexity because the pipes can be connected in the same way 
complex queries are formed from simpler ones. 



1 Introduction 

It has been noted since its conception that the structured nature of XML al- 
lows a better representation and more precise querying of text documents. Even 
though current XML query languages, such as XPath and XQuery [5], are very 
powerful in expressing exact queries over XML data, they do not meet the needs 
of the IR community since they do not support approximate matching based on 
textual similarity with relevance ranking. There are already a number of recent 
proposals on extending the XQuery language with IR-style search primitives, 
such as TexQuery [3,6], XQuery /IR [7], ELIXIR [9], XIRQL [10], XXL [17-19], 
XRANK [13], and XIRCUS [15]. Furthermore, there is an ongoing effort by the 
W3C community to provide usage scenarios for full-text queries [22], but there is 
still no emerging standard for a full-text extension to XQuery. It is not the goal 
of this paper to propose yet another document-centric language for XML docu- 
ments. Instead, we present a framework for XQuery processing in which the most 
important features of these emerging proposals can be seamlessly incorporated 
to the XQuery semantics and can be processed efficiently. 

More specifically, the objectives of this work are: 

1 . to use simple but powerful language extensions to XQuery based on current 
proposals to accommodate IR-style approximate matching with relevance 
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ranking; these syntactic extensions should require minimal changes to the 
existing XQuery semantics and should be amenable to efficient evaluation; 

2. to design an indexing scheme for XML data for efficient evaluation of both 
search predicates and path navigation; 

3. to build a highly pipelined XQuery engine, called XQuery Rank, that evalu- 
ates XQueries against the indexes using merge joins exclusively and without 
materializing intermediate results; 

4. this engine should consist of iterators that naturally reflect the syntactic 
structures of XQuery and can be composed into pipelines in the same way 
the corresponding XQuery structures are composed to form complex queries; 

5. the XQuery translation should be concise, clean, and completely composi- 
tional, so that optimizations can be easily incorporated later; 

6. even though the prototype system is not intended to be complete, since 
XQuery is a very complex language, it should be designed in such a way 
that all supported language features can be be incorporated later without 
strain. 

The main contribution of this paper is the development of a coherent frame- 
work for XQuery processing that incorporates IR-style approximate matching 
and allows the ordering of results by their relevance score. Our relevance rank- 
ing algorithm is based on both stem matching and term proximity. Our XQuery 
processor is stream-based, consisting of iterators connected into pipelines [12]. 
This pipeline model avoids materialization of intermediate results when possi- 
ble. In our framework, all values produced by XQuery expressions are assigned 
scores, and these scores propagate and are combined when piped through the 
iterators. The scoring is done implicitly by the stream iterators that evaluate the 
query. The most important feature of our evaluation engine is the use of struc- 
tural and content-based inverse indexes that deliver data in document order and 
facilitate the use of efficient merge joins to evaluate path expressions and search 
predicates. We present the rules for the translation from a large part of XQuery 
(which includes the search extensions) to query plans that represent the iterator 
pipeline. 

The closest work to ours is the TIX algebra [2]. Like the TAX algebra, TIX 
works on pattern trees, which captures path expressions over single documents, 
even though there is a proposed syntax for incorporating relevance ranking to 
XQuery that corresponds to the TIX operators. With XQueryRank, on the other 
hand, one may correlate multiple documents in the same query, may query all 
indexed documents at once, and may use any kind of query nesting and com- 
plexity. More importantly, we give the full details of the XQuery translation into 
efficient evaluation plans. Furthermore, while their term matching algorithm is 
limited to conjunctions of search terms, our method is as effective for disjunction 
and negation. Finally, in contrast to our approach, which propagates and com- 
bines relevance scores when fragments are piped through operators, TIX trees 
are annotated with relevance scores, called scored trees. This is a non-functional 
approach that does not work well with the functional semantics of XQuery, be- 
cause the same tree may be assigned a different score under a different context. 
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2 Approximate Matching in XQuery 

There are already a number of recent proposals on extending the XQuery lan- 
guage with IR-style search primitives. It is not the main goal of this paper to 
propose yet another document-centric query language; rather, we aim at de- 
veloping a framework in which these query language extensions can be easily 
incorporated into an XQuery processor by providing a language that reflects the 
most important features of these proposals. We are using the following simple 
extensions to XQuery that have been influenced by TexQuery [3,6]: 

1. the expression document () that matches any indexed document and returns 
the top level elements of all these documents. The order of the documents 
is unspecified and may depend on the time the documents were indexed. 

2. the boolean IR predicate e ~ S', where e is any XQuery expression, that 
returns true if at least one element from the sequence returned by e matches 
the IR search specification, S. The specification S has the following syntax: 

S, Si, S 2 ::= phrase 

I Si and S 2 conjunction 
I Si or S 2 disjunction 
I not S negation 

Note that a phrase must be present in the text of some descendant of the 
element e and must be in consecutive words that do not cross element bound- 
aries. 

3. the function score(e), for each element from the XQuery expression e, returns 
its score (the relevance assessment) . A score is a real number between 0 and 1: 
a zero score means false, while a non-zero score means true. The closer the 
value score is to 1, the more the value is relevant to the query result. 

For example, the following query 

<answer>{ 

( for $db in document () /biblio , 

$b in $db/bib [title ~ ("XQuery processing" and "relevance")] 
where $b/abstract ~ ("SAX" and not "DOM") 
order by score ($b) descending 

return <paper>{ $b/author/name , $b/title, score ($b) }</paper> 

) [position()<=10] 

}</answer> 

searches all indexed documents for papers that contain the phrase “XQuery 
processing” and the word “relevance” in the title, and whose abstract contains 
the word “SAX” but not the word “DOM”. The resulting papers are ordered 
by descending order of relevance and the top ten relevant papers are returned. 
The result of the query consists of links to document fragments (ie, triples with 
the document location and the begin/end positions of an element), rather than 
the actual text of the fragments, because it is impossible to reconstruct the 
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original text from the inverse indexes alone. A user may in turn dereference 
some of the returned links, assuming that the original XML documents are still 
available. Note also that, if not ordered, the resulting papers will be listed in the 
document order, regardless of scores, as is enforced by the XQuery semantics. 
Since there may be many indexed documents involved, if not ordered, the paper 
listings of each document would appear together and the order of documents 
may depend on the time they were indexed. As expected, scores are propagated 
and combined across path expressions and other XQuery syntactic constructs, so 
that results from any kind of XQuery expression can be assigned scores. Finally, 
words in phrases must be continuous in the XML document and term weighting 
is based on both stem matching and term proximity. 



2.1 Semantics of the XQuery Extensions 



In our framework, an XML element in an indexed XML document is assigned a 
triple (begin,end,level), based on its position in the document (ie, the positions 
of its begin/end tags) and on its depth level [21]. If an element a; is a child of an 
element y in an XML document, then the (begin,end) region of x is contained 
by the corresponding region of y and the level of x is the level of y plus one. 
Furthermore, a preceding sibling of an element has a preceding, non-overlapping 
region. This numbering scheme, called region encoding [21], has already been 
used by many systems for effective XML indexing and efficient path expression 
evaluation. 

An XML fragment (called XF ) is an XML element from an indexed document 
that may contain text. Text is tokenized into terms (words) that reflect the 
linguistic tokens of a given human language. A term in an XF is assigned a pair 
(position, level) so that the position of the term is contained by the XF’s region 
and the position order reflects the term order in the XF. The level of a term is 
equal to the level of the enclosing XF. In addition, to facilitate phrase matching 
and proximity calculations, our numbering scheme guarantees that consecutive 
terms be assigned succeeding positions. 

Since the unit of XQuery processing is a sequence of XML elements, the 
unit of full-text searching should be a sequence of XF elements, rather than the 
entire document. A term in an XF can be assigned a weight yV|term] based on 
the standard IR term-frequency/inverse-document-frequency (tf-idf) relevance 
ranking. We are using the following weight: 



frequency-of-term-in-XF 

number-of-terms-in-XF 



xlog 



/ term-level — e- level + 1 



e-level 



/ XFs-containing-term \ 
\total-number-of-XFs / 



where e-level is the level of the element e being tested for matching. The first 
term is the frequency of the term in the XF, the second term depends on how 
close is the level of the term with that of the element under consideration, and 
the last term depends on how rare is the term in the database (rare terms are 
given higher weight). The boolean connectives in a search specification can be 
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assigned a score based on the probabilistic interpretation of weights: 

WfSi Md ^2l = W|5i] X W|^2] 

W|5i or = Wl^il + WI^2] - Wl^il X W|52] 

Winpt 51 = 1 - W[5] 

As it can be easily proven, this interpretation preserves almost all the boolean 
laws, such as W| not (5i and 52 )] = yV|f not 5i) or (not 52 )] and yV|not (not 5)] 
=yy|5], but does not preserve idempotent laws, such as {S and S)= S. Unfortu- 
nately, this probabilistic interpretation is not very useful, since it does not take 
term proximity into account; the following interpretation though does. 

Consider a conjunction of n terms tested against an XF 

“Ti” and ... and 

where some instance of the term “T/’ is located at position pi in the XF and 
has weight Wi. We abstract this conjunction with the quadruple 

(min(pi, . . . ,p„), max(pi, . . . ,p„), n, x • • • x ) 

That is, instead of remembering all term positions, we abstract them with the 
smallest interval that contains these positions and assume that these n positions 
are distributed uniformly along this interval. In general, a conjunction of terms 
is represented by the quadruple (6, e, n, w) (which stand for begin-position, end- 
position, number-of-points, and weight). It is assigned the score 

where ‘size’ is the XF size (the total number of terms in all descendants of the 
XF). The first score factor is proportional to the term proximity since (e — b)/n 
is the average distance between terms. 

To make this idea work for any type of boolean connective in a search pred- 
icate, S, we associate two sets to S: one, W|5].T, that contains the positive 
quadruples and another, W|5].F’, that contains the negative quadruples. Since 
a quadruple abstracts a conjunction of terms, W|5].T is the disjunction of all 
these conjunctions. If there are no negative terms, then the score of a search 
predicate S is: 

score(5) = ©/{ x w 1 {b,e,n,w) G W[5].r} 

where { e [ ... } is a set former notation, much like the tuple relational calculus, 
the binary associative function © gives the weight of a disjunction: 



Wi (BW2 = Wi + W2 — Wi X W2 

and ©/{wi, . . . , Wn} reduces the set {wi, . . . , w„} by © as follows: 

©/{Wl, W2, • ■ • , W„} = Wi©W2©---©Wn 

That is, each cost is proportional to both the term weight and the proximity 
between the constituent terms and the total score is accumulated by repeatedly 
applying the weighting function for disjunctions. 
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Before we give the general score formula for any search predicate, we de- 
scribe the rules for the boolean connectives. Conjunctions are formed by merging 
quadruples using the following operator: 

A tXl 0 =0 tXl A = {(6,e,n,f)l(6,e,n,w) G A} 

Ai tX] ^2 = { ( min( 5 i, 62), max(ei, 62), ni + ri2, wi x W2) 

I (6i,ei,ni, wi) G Ai A (62, 62, ri2, ^2) G ^2 } 

Then, the search specifications are translated as follows: 



W[S'i md S2l.r = WISil.T CX] W[52].T W|Si md S2j-F = W|Si].F U WIS21.T 
W[5'i or S21-T = WISil.T U W[5’2l.T W[5i or S 2 IF = W|Si].F IX] WIS 2 I.T 
W[not SJ.T = W[S'].f W[not Sj.F = W[g|.r 

This interpretation too preserves almost all boolean laws, as it can be seen from 
the following case: 

W|not( 5 'i mid 52 ) 1 . T = W[ 5 iand 52 l.F = W[ 5 i].J^ U W| 52 l.i^ 

= yV| not 5 i].T U yy| not 59]. T = W|(not 5 i) or (not 52 )].T 

Furthermore, a term “T” with weight w that appears at the positions pi, . . . ,Pn, 
for n > 0 , in an XF has W|“T”].F = 0 and: 

W|‘T”].T = {(pi,pi,l,w),...,(p„,p„,l,w)} 

In addition, a phrase “Ti T2 . . . has the following W|‘Ti T2 . . . T„”].T: 

{ {pi,Pn,n,wi X • • • X Wn) 1 (pi,Pi, 1 , Wi) G W|‘Ti”].T 

A ... A G W[“T„”].T 

A P2 = Pi + 1 A . . . A p„ = Pn-l -I- 1 } 



That is, terms in a phrase must have succeeding positions. Finally, the total 
score of any search predicate S, for a non-empty yV| 5 ].F", is given by: 



0 /{ 



|min(ei,e2) — max(bi,b2)| 



X (1- 



1 (6i,ei,ni,wi) G W|5].T A 



) X X (1 — (1 — X W2) 

niXsize^ -'-V V n2Xsize/ 

{b2,e2,n2,W2) G W|5].F} 



The motivation behind this formula is that negative terms should be as far 
as possible from positive terms. This requirement is materialized by the factor 
|min(ei,e2) — max(6i,62)|, which is proportional to the distance between the 
positive and the negative quadruples. 

In our framework, all values produced by XQuery expressions are assigned 
scores. This is done implicitly by the stream iterators that evaluate the query, 
which are described in Section 4 . The unit of communication between itera- 
tors is a tuple that contains a number of elements, which typically correspond 
to document fragments. Each fragment is associated with a score, and these 
scores propagate and combined when piped through the iterators. For example, 
a predicate in a path expression uses the disjunction weighting function, 0, to 
combine the predicate scores. If the resulting score is zero, the path fragment 
is discarded; otherwise, its score is multiplied by the resulting predicate score. 
Therefore, the score of consecutive predicates in a path is the product of their 
constituent scores. 
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3 The Inverse XML Indexes 

Our system uses four inverse indexes: one for XML tags, one for text terms, one 
for attribute names, and one for attribute values. The design of these indexes 
was done in a way that facilitates efficient query processing (using merge joins 
exclusively) . Each inverse index consists of one table, called documents, which is 
common to all indexes, and three other tables that constitute the inverse index 
(the code is given in Java): 

class Inverseindex { 

Document)] documents; 

Key]] keys; 

Posting]] postings; 

Hit ] ] hits; } 

The vector documents is a binary search vector that contains the URLs of the 
indexed XML documents: 

class Document { String url ; } 

Each XML document URL appears once in the documents vector. The keys vector 
is a binary search vector that implements the index dictionary: 

class Key { 

String key; 

int fragments; // how many fragments contain this term 

int t ot al -frequency ; // times term appears in a I I documents 

int first -post ing ; } // location of the first posting 

Each term/tag must appear once in the dictionary. That is, Vi,j : i < j ^ 

keys[f].key < keysjjj.key. For every XML document that contains a key in keys, 
there is a unique entry in postings, and all these entries appear in succeeding 
positions ordered by document number, starting at the first-posting position of 
the key: 

class Posting { 

int document; // document location 

int frequency; // how many times the tag appears in document 

int first-hit ; } // lo cation of the first hit 

That is, all postings between keys[f].first-posting and keys[i + Ij.first-posting 
are associated with the key keysjfj.key and are ordered by document number. 
Each posting, postings[f], is associated with postings ]f] .frequency number of hits 
in the hits vector, starting from postingsji] .first-hit, which are ordered by the 
begin/position attribute. The Hit element structure depends on the type of the 
index. For the tag index, it is: 

class TagHit extends Hit { 

short begin; // the start position of term in document 

short end; // the end position of term in document 

short level; // depth of term in document 

short ordinal; } // ordinal of element within parent 
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while for the term index is: 
class TermHit extends Hit { 

short position; // the start position of term in document 
short level; } // depth of term in document 

The iterator used for accessing the inverse indexes delivers the Posting/Hit 
pairs sorted by (documentjiumber, begin/position) order, called the index order. 
That is, the major order is the document number and the minor order is the begin 
position of a TagHit or the position of a TermHit. This is a very crucial property 
because it makes possible the implementation of both path expressions and IR 
search specifications using merge joins without the need of sorting the input 
first. Furthermore, this order is actually the document order for each document. 
The Indexiterator for an inverse index provides the method open (key) to open the 
stream for accessing key hits, and the method next() to get the next Posting/Hit 
from the stream: 

class Indexiterator { 

Inverseindex index ; 
int ck ; // current key 

int cp ; // current posting 

int ch ; // current hit 

int max_posting; 

boolean eos ; // is this the end of stream? 

void open ( String key ) { 

ck = binary .search ( index . keys , key ) ; 
eos = ! key . equals ( index . keys [ ck ]. key ) ; 
cp = index . keys [ ck ]. fi r st _p o st i n g ; 
ch = index.postings[cp].first_hit; 

max.posting = cp + index . keys [ ck ]. document.number ; 

} 

void next ( ) { 

if ( ch >= index. postings[cp]. first. hit 

+ index . postings [ cp ]. frequency — 1) 
if (++cp < max.posting) 

ch = index . post ings [ cp ]. f i r s t . h i t ; 
else eos = true ; 
else ch++; 

} 

boolean eos () { return eos; } } 

Our system populates the indexes by parsing XML documents using SAX. The 
text in XF elements is tokenized, stopwords are eliminated, and each token 
is stemmed using Porter’s algorithm [16]. The SAX events startElement and 
endElement, and each token in text (from the SAX event characters) are assigned 
succeeding positions that correspond to document order. Currently, the indexes 
reside entirely in memory but they can be dumped to and read from binary files. 
We leave the implementation of indexes using B’^-trees for a future work. 
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4 The XQuery Processor 



Our XQuery processor is stream-based, consisting of iterators that read data 
from input streams and deliver data to the output stream. This is done in a 
pull-based fashion, in which iterators are connected through pipelines and an 
iterator (the producer) delivers a stream element to the output only when re- 
quested by the next operator in pipeline (the consumer). To deliver one stream 
element to the output, the producer becomes a consumer by requesting from the 
previous iterator as many elements as necessary to produce a single element, etc, 
until the end of stream. This pipeline model is very popular in database query 
processing [12] because it avoids materialization of intermediate results, when 
possible (it is not possible for blocking operations, such as sorting and grouping). 

The stream unit in our framework is a fragment retrieved from the inverse 
indexes. Because of the complexity of XQuery, though, XML elements may be 
constructed on the fly and operated upon by path expressions or other XQuery 
operations. It is essential, therefore, to support fragments from indexed docu- 
ments as well as XML elements constructed on the fly. They are all subclasses 
of Element: 



abstract class Element { 

float score; } // relevance assessment of element 



Then, a fragment describes an element from an indexed document: 



class Fragment extends 
int document ; 
short begin ; 
short end ; 
short level; } 



Element { 

// document ID 

// the start position in document 
// the end position in document 
// depth of term in document 



An XML element constructed on the fly is similar to a DOM element: 

class ConstructedElement extends Element { 

String tagname; 

Element II sequence; // children 

Attributes attributes; } // SAX— like attributes 
class PCData extends Element { String data; } 



The support of constructed XML elements allows us to process XML documents 
in their native form (not indexed) using the document (url) XQuery construct. 

Finally, to support queries on indexed documents effectively in the case we do 
not have an index key to search, such as in the query count (document ()/*/*) , 
which counts all indexed elements of level 2, we support patterns that match all 
indexed fragments within a given depth region: 

class Pattern extends Element { 

int min.level; // minimum depth in document 

int max_level ; } // maximum depth in document 



In fact, the document () expression is translated into an iterator that produces 
only one element: namely, Pattern(0,0), that matches all top-level indexed ele- 
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merits. Later, other iterators may convert this element into a stream of Frag- 
ments, when more information is provided (such as, after a tagged projection). 

Since XQuery supports FLWOR expressions with variables, all variables bind- 
ings of for-clauses are concatenated into a tuple. Recall that the values of for- 
clauses are always single elements, while the values of let-clauses may actually 
be sequences. We will see how let- variables are treated later. Thus, the unit of 
communication between iterators is a Tuple defined as: 

class Tuple { Element)] components; } 

All iterators are objects that are instances of classes that are subclasses of the 
class Iterator: 

class Iterator { 

Tuple current; // current tuple from stream 

void open (); // open the stream iterator 

Tuple next (); // get the next tuple from stream 

boolean eos (); } // is this the end of stream? 

One iterator that uses the tag index is, Child(tag, input) 

class Child extends Iterator { 

String tag ; 

Iterator input ; 

Indexiterator ti; } 

where ti=tag_index.open(tag) is the iterator over the tag index. The body of the 
next() method below merges the input stream with the stream produced by the 
tag index. This is possible because almost all operators preserve the index order 
(the only exception is the order-by in FLWOR, which destroys the order and, 
thus, requires reordering if processed further). 

Tuple next () { 

while (!ti.eos() ! input . eos () ) { 

if ( input . current [ 0 ] instanceof Fragment) { 

Fragment If = (Fragment) input . current [ 0 ] ; 

Key k = ti . key ( ) ; 

Posting p = t i . posting 0 ; 

TagHit h = (TagHit) ti.hit(); 
if (If. document < p . document ) 
input . next ( ) ; 

else if (If. document > p . document ) 
t i . next ( ) ; 

else if (If. begin < h. begin &fe If. end > h . end 
&fe h. level lf.level-l-1) { 
current = new Tuple (new Fragment (p . document , 

h . begin , h . end , h . level ) ) ; 

t i . next ( ) ; 
return current ; 

} else if (If. begin < h. begin) 
input . next ( ) ; 
else t i . next ( ) ; ... 
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This method works for non-recursive schemas only. A more general algorithm 
that can handle recursive schemas would require a stack of ancestors (see, for 
example, the stack-tree algorithm [1] and the holistic twig join algorithm [8]). For 
example, the XQuery document ("cs. xml") /department /gradstudent /name is 
translated into the following pipeline of iterators: 

new Child ( ” name” , new Child (’’gradstudent”, new Child (’’department”, 

new Document (” cs . xml” ))) ) 

where the Document iterator delivers a single ConstructedElement object, namely 
the top-level XML element of the document, after the entire document has been 
read and stored in memory. 

For-loops in a FLWOR expression are evaluated using the Loop and Step 
iterators. For example, the FLWOR expression 

for $d in document () /department, $s in $d/gradstudent 

is translated into the pipeline: 

Iterator s = new Step ( ) ; 

new Loop (new Child (” department” ,new Step()) , 

s , 

new Child (” gradstudent” , new S elect ( 0 , s ))) ; 

The Select iterator returns a singleton tuple containing the nth element of the 
tuple. It basically accesses a for-variable. Our FLWOR evaluation scheme com- 
pletely avoids materialization of intermediate results. Basically, the Step iterator 
delivers one element only and then signals the end of stream: 

class Step extends Iterator { 
boolean first ; 

Tuple tuple = new Tuple(new Pattern ( 0 , 0 )) ; 
void open () { first = true; current = tuple; } 

Tuple next () { first = false; return current; } 
void set ( Tuple t ) { tuple = t ; } 
boolean eos () { return ! first ; } } 

Now, the Loop iterator, which is defined as: 

class Loop extends Iterator { 

Iterator left ; 

Step right _step ; 

Iterator right ; } 

processes the left stream one tuple at a time, and sets the current tuple of the 
Step iterator. Then it opens the right stream and processes it one tuple at a time, 
up to the end of stream. Then it repeats this process for the next left tuple, etc, 
until the end of the left stream. More specifically, the next method of Loop is: 

Tuple next () { 

if ( ! le ft . eos ( ) ) { 

while ( right . eos 0 ) { 
left. next ( ) ; 
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right.step . set ( left . current ) ; 
right . open 0 ; }; 

current = 1 e f t . current . append ( right . current ) ; 
right . next ( ) ; 
return current ; } } 

The same technique is used in evaluating predicates in path expressions and in 
FLWOR expressions. For example, the path expression: 

document 0 /department/gradstudent [gpa/text () = "3 . 5"] /name 

is translated into: 

Iterator s = new Step () ; 
new Child ("name” , 

new Predicate (new Child (” gradstudent ” , 

new Child (” department” ,new Step())) , 

s , 

new Call (” eq” , new Text (new Child (” gpa” , s )) , 
new Constant (” 3 . 5 ”))) ) 

where the Predicate iterator operates like Loop, by reopening and processing the 
condition stream for each tuple in the left stream. Two other iterators. Return 
and Sort, use the same stepper to evaluate an expression (the return expression 
of Return and the sorting values of Sort) for each input tuple. 

The let-bindings in FLWOR expressions are the hardest to implement be- 
cause each variable may be bound to a sequence (ie, a stream of tuples), rather 
than one tuple. Of course, one may materialize the bound stream into a vec- 
tor, but this would be infeasible for large sequences. In our implementation, a 
let-variable is bound to an Iterator that delivers the sequence of tuples, but 
the Iterator stream is cloned as many times as the times the let-variable is ac- 
cessed. The cloning is done using a queue with multiple pointers (one for each 
instance of the let-variable). The size of the queue depends on the backlog be- 
tween the fastest and the slowest consumer. In some extreme cases, such as 
let $v:=e return ($v,$v), it is the size of the entire stream (since concate- 
nation waits for the end of the left stream before opening the right stream) . 

5 XQuery Translation 

Figure 1 gives the rules for the translation. An XQuery e is translated into the 
plan T(|e], StepO), where the semantic brackets (|J) enclose XQuery syntax 
(ie, they represent the abstract syntax tree associated with the syntax). Basi- 
cally, T(|e] , c) translates the XQuery syntax e into an iterator (the consumer) 
that receives stream data from the the iterator c (the producer). Function calls, 
/(ai, . . . , a„), include binary operations, such as -F, *, and, =, <, if-then-else, 
etc. Element construction < tag > ... < /tag > is equivalent to a call to 
element (tag, e), where e is the concatenation of the element components (us- 
ing the comma operator). Function J^(|FL], c, s) translates a sequence of for/let 
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Translation of XQuery expressions: e, ei, . . . , Cn. 



T(|“text”|,c) = Constant(“text”,c) 

T([$u], c) — Select(z;,c) (a For-variable) 

T([$u], c) = LetVar(u, /c) (a Let-variable that uses the A:th 

copy of the cloned Let-binding) 



r ( \path\ ,c) =V{ [path] , c) 

T(|document()|, c) — Step() 

T(|document(ur^)|, c) = Document(ur^) 

T(|element(tap',attrs,e)], c) = Element(ta 5 r, attrs, T([e], c)) 

T(|score(e)J, c) — Score (T( [e], c)) 

T([e~5],c)^7^([5],r([el,c)) 

r([/(ei,...,en)l,c)^Call(/,T(M,c),...,r(IeJ,c)) 

T([ei, 62!, c) ^ Concatenate(T(|ei],c),T([e2],c)) 

T(|some $v in ei satisfies 62 ! , c) = T([ei], T(|e 2 ], c)) 

T ([every $v in ei satisfies e 2 j, c) = T ([not (some $u in ei satisfies not(e 2 ))], c) 

T (l-FL where ei order by 02 return esj, c) 

" Assign(u;, Step(), 

Sort(Return(Predicate(^([FL], c, Var(i(;)), Var(u;), 
= ^([ei], Var(iu))), 

Var(M),T([e3], Var(t«))) 
Var(w),T(|e2], Var(u;)))) 



Translation of path expressions: path 

V{l/Apath},c) = Child(A,T’([path.],c)) 

V{l//ApathJ, c) = Descendant(A, 7^([path], c)) 

P(I/* path},c) = Any (P( [path], c)) 

V{l/@Apath},c) = Attribute(A,P([pafh|, c)) 

P([[e] pathj,c) = Assign]™, Step]), Predicate]?^] [path], c), Var(™), T ([e|, Var(™)))) 
'P{lepath},c) = T([e],P([paf/i],c)) 
nn,c:) = c 



Translation of FLWOR for/let bindings: FL 

iF( Ifor $a in e FLj, c, s) = iF([FL], T([e|, c), s) (first for- loop) 

iF( Ifor $a in e FLj, c, s) = iF([FL], Assign]™, Step]), Loop]c, Var]™),T][e], Var]™)))), s) 
billet $v := eFL|,c,s) = Assign]n, Clone]T][e], s), n), ^][FL|, c, s)) (clone n times) 
•^(D.C:S) = c 



Translation of IR search predicates: 5, 5i, S 2 

7^ ([ “phrase” ], c) = Phrase(c, “phrase”) 

7^f[5i and c) — Conjunction(7^([5'i], c), 7^([52], c)) 
7^([not 5], c) — Negation(7?,([5], c)) 



Fig. 1. Translation from XQuery to Stream Iterators {w are fresh variables) 
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bindings FL in a FLWOR expression from left to right. The parameter s is the 
stepper used by the Return iterator of the FLWOR expression. 

The Assign/Var operators do not correspond to any iterator. Instead, they 
are used for local assignments: Assign binds a local variable to an iterator and 
Var returns the value of the local variable. 

As a simple example of translation, the following XQuery: 

do cument ( " c s . xml " ) / department / gr adstudent [gpa/ 1 ext ( ) = " 3 . 5 " ] /name 

is translated into: 

Child (name, Assign (0, Step 0 , 

Predicate (ChildCgradstudent , Child (department .Document ("cs . xml") ) ) , 

Var (0) , 

Assign(l,Var(0) , Call (eq, Text (Child(gpa, Var (1))) .Constant ("3.5")))))) 



6 Conclusion 

Our language extensions are very simple, yet powerful enough to capture many 
emerging proposals for document-centric retrieval of XML data with relevance 
ranking. The XQueryRank evaluation engine is made out of pipelines of stream 
iterators that avoid materialization of intermediate results when possible and 
use efficient merge joins to evaluate path expressions and search predicates. Its 
modular approach of building pipelines to evaluate XQuery scales up to any 
query complexity because the pipes can be connected in the same way complex 
queries are formed from simpler ones. 

The Java source of the prototype system is available at 

http : / /lambda . uta . edu/XQueryRank . tar . gz 

Acknowledgments: This work is supported in part by the National Science 
Foundation under the grant IIS-0307460. 
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Abstract. We study the problem of storing XML documents using re- 
lational mappings. We propose a formalization of classes of mapping 
schemes based on the languages used for dehning functions that assign 
relational databases to XML documents and vice-versa. We also discuss 
notions of information preservation for mapping schemes; we define loss- 
less mapping schemes as those that preserve the structure and content of 
the documents, and validating mapping schemes as those in which valid 
documents can be mapped into legal databases, and all legal databases 
are (equivalent to) mappings of valid documents. We dehne one natural 
class of mapping schemes that captures all mappings in the literature, 
and show negative results for testing whether such mappings are loss- 
less or validating. Finally, we propose a lossless and validating mapping 
scheme, and show that it performs well in the presence of updates. 



1 Introduction 

The problem of storing XML documents using relational engines has attracted 
significant interest with a view to leveraging the powerful and reliable data 
management services provided by these engines. In a mapping-based XML stor- 
age system, the XML document is represented as a relational database and 
queries (including updates) over the document are translated into queries over 
the database. Thus, it is important that the translation of XML queries and up- 
dates into SQL transactions be correct. In particular, only updates resulting in 
valid documents should be allowed. To date, the focus of the work in designing 
mapping schemes for XML documents has been on ensuring that XML queries 
over the documents can be answered using their relational representations. In 
this paper, we study the problem of designing mapping schemes that ensure only 
valid update operations can be executed. 

An important requirement in mapping systems is that any query over the 
document must be computable from the database that represents the docu- 
ment. In particular, the mapping must be lossless, that is, the document itself 
must be recoverable from its relational image. When updates are considered, 
one must be able to test whether an operation results in the representation of 
a valid document before applying the operation to the database. This amounts 
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to checking whether the resulting database represents a well-formed document 
(i.e., a tree) and that this document conforms to a document schema. Several 
works [4,22] exploit information about a document schema for designing better 
mapping schemes; the metric in most of these is query performance, for example, 
the number of joins required for answering certain queries. However, to the best 
of our knowledge, there has been no work on ensuring that only valid updates 
can be processed over relational representations of documents. 

1.1 Related Work 

There is extensive literature on defining information preserving mappings for 
relational databases [17]; our notion of lossless mapping schemes is inspired by 
that of lossless decompositions of relations. The incremental checking of rela- 
tional view/integrity constraints has also been widely studied [15,12]. Some of 
the view maintenance techniques have been extended to semi-structured mod- 
els that were precursors of XML [23] . Another related problem, the incremental 
validation of XML documents w.r.t. XML schema formalisms, has been studied 
in [3,5,19]; we use the incremental algorithms proposed in [3] in this work. 

The literature on mapping XML documents to relational databases is also 
vast [16]. One of the first proposals was the Edge [14] scheme, a generic approach 
that explicitly stores all the edges in a document tree. Departing from generic 
mappings, several specialized strategies have been proposed which make use of 
schema information about the documents to generate more efficient mappings. 
Shanmugasundaram et al. [22] describe three specialized strategies which aim to 
minimize data fragmentation by inlining, whenever possible, the content of cer- 
tain elements as columns in the relation that represents their parents. LegoDB [4] 
is a cost-based tool that generates relational mappings using inlining as well as 
other operations [21]; the goal there is to find a relational configuration with 
lowest cost for processing a given query workload on a given XML document. 
STORED [11] is a hybrid method for mapping semistructured data in which rela- 
tional tables are used for storing the most regular and frequent structures, while 
overflow graphs store the remaining portions of the semistructured database. 
None of these works deals with checking the validity of updates. 

Orthogonal to designing mapping approaches is the translation of constraints 
in the document schema to the relational schema for the mappings. For instance, 
the propagation of keys [9] and functional dependencies [8] have been studied. 
Besides capturing the semantics of the original document schema, these tech- 
niques have been shown to improve the mappings by, e.g., reducing the storage 
of redundant information [8] . However, these works do not consider the transla- 
tion of the constraints defined by the content models (i.e., regular expressions) 
in the document schema, which is the focus of this work. We note that these 
techniques can be applied directly in our method. 

Designing relational mappings for XML documents can be viewed as the 
reincarnation of the well-known problem of “simulating” semantic models in 
relational schemas [1], and there are mapping schemes that follow this “classical” 
approach. However, the semantic mismatch between the XML and relational 
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models is much more pronounced than in previously considered models. For 
example, XML data is inherently ordered; moreover, preserving this ordering 
is crucial for deciding the validity of the elements in the document. Mani and 
Lee [18] define an extended (yet unordered) ER model for describing XML data, 
and provide algorithms for translating these models into relational schemas. 

For this study, we had to fix an update language; we settled for a simple 
language, as the goal was to verify the feasibility of our approach rather than 
proposing a new language. For a discussion on updating XML, see [24]. 



Contributions and Outline. In Section 3, we provide a formalization of map- 
ping schemes as pairs of functions, the mapping and publishing functions, that 
assign relational databases to XML documents and vice-versa. We introduce in 
Section 3.1 the notion of parameterized classes of mapping schemes, which are 
defined by the languages used for specifying the mapping function, the publishing 
function, and the relational constraints in the target schema. In Section 3.2 we 
introduce XT>S : a natural class of mapping schemes powerful enough to express 
all, to the best of our knowledge, mapping schemes in the literature. In Section 4, 
we characterize mapping schemes with respect to information preservation. In 
particular, we define lossless mapping schemes as those that ensure all XML 
queries over a document can be executed over its corresponding database; and 
validating mapping schemes, which ensure only valid updates can be effected. 
We also show that testing both properties for XVS mapping schemes is unde- 
cidable. In Section 5 we propose an XVS mapping scheme that is both lossless 
and validating, and discuss the incremental maintenance of the constraints in 
such mappings in the presence of updates in Section 6. 

2 Preliminaries 

We model XML documents as ordered labeled trees with element and text nodes. 
The root of the tree represents the whole document and has one child, which 
represents the outermost element. Following the XML standard [27], the content 
of an element node e is the (possibly empty) list of element or text nodes that 
are children of e. For simplicity, in an element with mixed content (i.e., element 
and text nodes as its children), we replace each text node by an element node 
whose label is ^PCDATA and whose only child is the given text node. Thus, 
each element in the tree has either text or element nodes as its content; also, all 
text nodes are leaf nodes in the tree. 

More precisely, we define the following. Let X, V be two disjoint, countably 
infinite sets of node ids and values. 

Definition 1 (XML Document) An XML document is a tuple {T,\,t,v) , 
such that T is an ordered tree whose nodes are elements ofX, X : I ^ V is an 
assignment of labels to nodes in T, r : I — >■ {elemenf text} is an assignment of 
node types to the nodes in T, and v : X ^ V is an assignment of values to text 
nodes in T (i.e., v is undefined for a node e if t{c) yf text). 
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We model document schemas as sets of rules which constrain the structure 
and the labels of the elements in the document. Such element declaration rules 
assign content models to element types. An element type is defined by a path 
expression of the form [ci]/li where li is an element label and Ci is an optional 
path expression specifying the context in which li occurs. We restrict contexts to 
be path expressions matching E := a \ E/E \ E/ /E, where a is either an element 
label or the wildcard * for matching elements of any label. We say an element 
is of type ti if it is in the result set of evaluating [ci]/li over the document. 
Following the XML standard, a content model is specified by a 1-unambiguous 
regular expression [6] of the form E:=e \ a \ ^j^PCDATA | (E) \ E\E \ E,E \ 
E* I E+ I El, where e represents the empty string, a is an element label, and 
:;^PCDATA represents textual content. Note that this captures all four content 
models for XML elements [27]. 

Definition 2 (Document Schema) A document schema X is a set of element 
declaration rules of the form ti ri where ti is an element type and ri is a 
content model, such that for any document D that is valid with respect to X (as 
defined below), for each element e in D, there is exactly one element type ti such 
that e is of type ti . 

Let e be an element of type ti, and ci, . . . , c„ be its (ordered) children; we say 
e is valid with respect to rule ti <— ri if the string A(ci) • • • A(c„) matches r^. We 
say a document D is valid with respect to a document schema X, denoted D G 
L{X) if all elements in D are valid with respect to the rules in X corresponding 
to their respective types. 

Let J, D be two disjoint countably infinite sets of surrogates and constants. 
We define relational database as follows: 

Definition 3 (Relational Database) Each relation scheme R has a set (pos- 
sibly empty) of attributes of domain 3, called the surrogate attributes of R, and 
a set (possibly empty) of attributes with domain CD. Everything else is defined as 
customary for the relational model. 

Intuitively, renaming node ids in a document does not yield a new document. 
Similarly, renaming surrogates in a database does not create a new mapping. In 
order to capture these properties, we define: 

Definition 4 (Document Equivalence) XML documents Di = 

(Ti, Ai, Ti, J^i), and D 2 = (72, A 2 , T 2 , ^^ 2 ); are equivalent, denoted by D\ = D 2 , 
if there exists an isomorphism (f> between T\ and T 2 s.t. Ai(f) = \ 2 { 4 >{v)), 
Ti(v) = T 2 (c/>(v)), and vi{v) = V2{4>{v)), for all v € T\. 

Definitions (Database Equivalence) Two database instances I\,l 2 are 
equivalent, written I\ = I 2 if there exists a bijection on 3 U T> that maps 3 
to 3, is the identity on T), and transforms I\ into I 2 . 

(The notion of database equivalence discussed above has been called OLD 
equivalence in the context of object databases [1].) 
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3 XML-to-Relational Mappings 

We define an XML-to-relational mapping scheme as a triple pt = (ct, tt, S), where 
S' is a relational schema; cr is a mapping function that assigns databases to 
XML documents; and tt is a publishing function that assigns XML documents 
to databases. More precisely, let X be the set of all XML documents; we define: 

Definition 6 (Mapping Scheme) A mapping scheme is a triple pi = (ct, it, S), 
where a \ X T^{S) is a partial function, and tt : TZ{S) X is a total function. 
Moreover: 

1. for all Di,D 2 D\ = D 2 implies cr{Di) = cr(I? 2 ), 

2. for all I\,l 2 Ii = I 2 implies tt{Ii) = tt{I 2 )- 

Defining cr to be a partial function allows mapping schemes where the rela- 
tional schemas are customized for documents conforming to a given document 
schema X [4,22]; in other words, mapping schemes where Dom(cr) = L{X). On 
the other hand, requiring tt to be total ensures that any legal database (as de- 
fined below) can be mapped into a document. Conditions 1 and 2 in Definition 6 
ensure that both a and tt are generic (in the sense of database theory [1]): they 
map equivalent documents to equivalent databases and vice-versa. 



3.1 Parameterized Classes of Mapping Schemes 

We define classes of mapping schemes based on the languages used for specifying 
cr, S, and tt. The power of these languages determines what kinds of mappings 
can be specified. For example, some sort of counting mechanism in the mapping 
language is required for specifying mapping functions that encode “interval- 
based” element ordering [10]. 

3.2 The XT>S Class of Mappings 

We now describe one particular class of mapping schemes that captures all map- 
pings proposed in the literature. In summary, we use an XQuery-like language 
for defining cr, we allow boolean datalog queries with inequality and stratified 
negation to be specified in the relational schema, and we use XQuery coupled 
with a standard publishing framework such as SilkRoute [13]. We will call this 
class of mappings XVS, (for XQuery, Datalog, and SilkRoute). 

The Mapping Language. The language consists of XQuery augmented with a 
clause sql . . . end for specifying SQL insert statements, and to be used instead 
of the return clause in a FLOWR expression. The semantics of the mapping 
expressions is defined similarly to the usual semantics of FLOWR expressions: 
the for , let , where , order by clauses define a list of tuples which are passed, 
one at a time, to the sql . . . end clause, and one SQL transaction is issued per 
such tuple. Unlike a query, a mapping expression does not return any values, 
and is declared within a procedure, instead of an XQuery function. 
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The Relational Constraints. In XVS mapping schemes, each constraint in 
the relational schema is a boolean query that must evaluate to false unless the 
constraint is violated. We say that a relational instance is legal if it does not 
violate any of the constraints. Each of these queries is expressed as a set of 
datalog rules, augmented with stratified negation and not-equals. This language 
allows easy expression of standard relational constraints such as functional de- 
pendencies and referential integrity, while the recursion in datalog can be used 
to express, for example, that the children of an element conform to the element’s 
content model, as shown in the Edge"*"^ mapping of Section 5. 

The Publishing Language. Publishing functions are arbitrary XQuery ex- 
pressions over a “canonical” XML view of a relational database. That is, each 
relation is mapped into an element whose children represent the tuples in that 
relation in the standard way (i.e., one element per column). This is the approach 
taken by SilkRoute [13] and XPERANTO [7]. Of course, there are several dif- 
ferent “canonical” views (i.e., documents) that represent the same database, as 
no implicit ordering exist among tuples or relations in the database. 

It is easy to see that fairly complex mapping schemes are possible in the 
XVS class. In fact, all mapping schemes that we are aware of in the literature 
are in XT>S. Below, we give an example of such a mapping scheme. 

Example 1. The Edge mapping scheme [14] belongs to XVS and can be de- 
scribed as follows. The relational schema S contains the relations (primary keys 
are underlined): 



Edge(pareni : J. child : 3, ordinal : O, label : D), Valuef e/emenf : 3, value : D) 



The Edge relation contains a tuple for each element-to-element edge in the 
document tree, consisting of the ids of the parent and child, the child’s ordinal, 
and the child’s label. The Value relation contains a tuple for each leaf in the tree. 
The relational schema also contains constraints for ensuring that: there is only 
one root element; every child has a parent; every element id in Value appears 
also in Edge; and that the ordinals of nodes in the tree are consistent. Note these 
constraints are easily expressed as boolean datalog programs. 

The mapping function cr is defined by a recursive procedure that visits the 
children of a given node, storing the element nodes in the Edge relation and the 
text nodes in Value relation: 

define procedure map_node($e as node){ 
for $n at $i in $e/+ 
if ($n instEuice of element ()) then 

sql INSERT INTO Edge VALUES (id($e) ,id($n) ,$i,name($n)) end; 
map_node ($n) ; 

if ($n instance of textO) then 

sql INSERT INTO Value VALUES (id($e),$n) end; } 

( : to map the document : ) 
map Jiode (doc ( " doc. xml" ) ) ; 
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<a> 

<b>f oo</b> 
mixed 
<b>bar</b> 
</a> 



(a) Document 



Edge: 

parent child ordinal label 

0 1 0 a 

12 0 b 

131 #PCDATA 
14 2 b 



Value: 

element value 

2 f oo 

3 mixed 

4 bar 



(b) Edge mapping 



Fig. 1. Edge mapping of a document. 



The id() function used in the procedure above can be any function that 
assigns a unique value to each node in the tree; for concreteness, we assume the 
function returns the node’s discovery time in a DFS traversal of the tree, and 
that the discovery time of the document node (i.e., the node that is the parent 
of the root element in the document) is 0. Figure 1 shows a document and its 
image under the Edge mapping. 

Finally, the publishing function tt is defined as follows. The publishing of an 
element e is done by finding all its children in the Edge and Value elements 
in the “canonical” published views, and returning them in the same order as 
their ordinal values. We use the order by clause in the FLOWR expression to 
reconstruct the sequence of element and text nodes in the same order as in the 
original document. 



4 Preserving Document Information 

As discussed in Section 1, a mapping-based storage system for an XML document 
should correctly translate any query over the document, including updates, into 
queries over the mapped data. In this section, we discuss two properties that 
ensure that queries and valid updates can always be processed over databases 
that represent documents. 

The generally assumed notion of information preservation for a mapping 
scheme is that it must be lossless; that is, one must be able to reconstruct 
any (fragment of a) document from its relational image [11]. More precisely, we 
define: 

Definition 7 (Lossless Mapping Scheme) A mapping scheme /i = (ct, Tr,^) 
is lossless if for all D G Dom(cr), 7r(cr(D)) = D. 

Informally, losslessness ensures that all queries over the documents can be 
answered using their mapped images: besides the naive approach of materializing 
7r(a(D)) and processing the query, there are several techniques for translating 
the XML queries into SQL for specific mappings and XML query languages [16]. 
The following is easy to verify: 
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Fig. 2. Updating documents and their mapping images. 



Proposition 1 The Edge mapping scheme is lossless. 

While losslessness is necessary for document reconstruction, it does not en- 
sure that databases represent valid documents, nor that updates to representa- 
tions of valid documents result in databases that also represent a valid docu- 
ments. As an example, consider a mapping scheme for documents conforming 
to the schema X : a ^ (5|#PCDATA)*; b ^ ^PCDATA, and recall the (loss- 
less) Edge mapping scheme discussed in Example 1. Figure 1(a) shows a valid 
document with respect to A, and its relational representation is shown in Fig- 
ure 1(b). It is easy to see that any permissible update (i.e., an update resulting 
in a valid document) to the document can be translated into an equivalent SQL 
transaction over the document’s representation. However, it is also possible to 
update the representation in a way that results in a legal database that does 
not represent a valid document with respect to X. For instance, the SQL trans- 
action that inserts the tuple (3,4, 0,c) into Edge and (4,0,foo) into Value, which 
results in a legal database, has no equivalent permissible XML update because it 
corresponds to inserting an element labeled c as a child of the second b element, 
which violates the document schema. 

We define a property of mapping schemes that ensures that all and only valid 
documents can be represented: 

Definition 8 (Validating Mapping Scheme) p, = (a, tt, S) is validating 
with respect to document schema X if a is total on L{X), and for all I € TZiS), 
there exists D G L{X) such that I = o{D). 

For simplicity, we drop the reference to the document schema X when dis- 
cussing a validating mapping scheme, and just say that a document is valid if 
it is valid with respect to X . Intuitively, a mapping scheme is validating with 
respect to X if it maps all valid documents to some legal database and if every 
legal database is (equivalent to) the image of some valid document under the 
mapping. This implies that all permissible updates to documents can be trans- 
lated into equivalent updates over their mappings, and vice-versa, as depicted 
in the diagram in Figure 2. 

We conclude this section by showing that there are theoretical impediments 
to the goal of automatically designing lossless and/or validating XVS mappings. 
The following results are direct consequences of the interactions of context-free 
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languages and chain datalog programs^ [25,26]. The proofs are omitted in the 
interest of space and are given in the full version of the paper [2]. 

Theorem 1 Testing losslessness of XT>S mapping schemes is undecidahle. 
Theorem 2 Testing validation of XVS mapping schemes is undecidahle. 

5 A Lossless and Validating XT>S Mapping Scheme 

In this section we introduce Edge^"*": a lossless and validating mapping scheme in 
XT>S. The Edge+“'" mapping scheme is an extension of the Edge mapping scheme 
that includes constraints for ensuring the validation property. In this section we 
fully describe the procedure for creating a mapping scheme /i = (ct, tt. S') given 
a document schema X, and show that /i is both lossless and validating with 
respect to X. 

5.1 The Relational Schema 

Recall the definition of relational databases in Section 2. Let 3' = 1 U {#}, # ^ 3 
(the symbol ff will be used for marking elements that have no children); also, let 
Q C ® be a set of states, T C D be a set of type identifiers, and 'BCD denote 
the set of boolean constants. The relational schema of Edge is as follows: 

Edge(porent : 3. child : 3, label : D), FLC(parent : 3, first : 3' , last : 3'), 

\LS{left : 3, right : 3), Valuef efement : 3, value : D), Type! element : 3, type : T), 
Transition (tj/pe : "7, from : Q, symbol : 33), to : Q, isAceepting : 3i) 

The Edge and Value relations are used essentially as in the Edge mapping 
scheme, except that the ordering of the element nodes is not explicitly stored. 
Instead, we use the FLC and ILS relations to represent the successor relation 
among element nodes that are children of the same parent element (ILS stands 
for “immediate- left-sibling” and FLC stands for “first-last-children”). The choice 
of keeping the ordering of the elements using the FLC and ILS relations is mo- 
tivated by the fact that they allow faster updates to the databases, as we do 
not need to increase (resp. decrease) the ordinals of potentially all children of an 
element after an insertion (resp. deletion). The constraints that we discuss here 
require only that we can access the next sibling of any element, and, of course, 
can be adapted to other ordering schemes (e.g., interval-based). 

More precisely, the FLC relation contains a tuple for each element e in the 
document whose content model is not ^PCDATA, consisting of the id of e and 
the ids of its first and last children; if e has no content (i.e., no children), a 
tuple (se,#,#) is stored in FLC, where Se is the surrogate to e’s id. The ILS 
relation contains a tuple consisting of the ids of consecutive element nodes that 

^ Chain datalog programs seek pairs of nodes x and p in a graph such that there exists 
a path X y whose labels spell a word in an associated language. 
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are children of the same parent. Thus, if ci, . . . ,c„ are the ordered children of 
an element e; we have (se,Sci,Sc„) G FLC, and {(sci , Scs), ■ • ■ , (sc„_i , Sc„)} C 
ILS. The Type relation contains the types of each element in the document. 
Finally, the Transition relation stores the transition functions of the automata 
that correspond to the content models in the document schema. 

Next, we discuss the two kinds of constraints defined in the relational schema. 

Structural Constraints. Similarly to the Edge mapping scheme, the structural 
constraints ensure that any legal database encodes a well-formed XML document 
(i.e., an ordered labeled tree). We also need to ensure that the ordering of the 
element nodes is consistent, and that each element has a type. We note that 
these constraints are easily encoded as boolean datalog programs. For instance, 
the following constraint ensures that no element that is marked as having no 
children is the parent of any other element: 

invalid FLC(p, #, #), Edge(p, _) 

Validation Constraints. The validation constraints encode the rules in the 
document schema into equivalent constraints over the relations in S, to ensure 
that the document represented by the mapping is indeed valid. For each rule 
ti •<— Tj in the document schema, we define the following datalog program: 

reacht^ (p, #, s) :- FLC(p, #, #), Type(p, L), Transition(L, qo, e, s, _) (1) 

reacht^ (p, c, s) :— Edge(p, c, x), FLC(p, c, _), Type(p, L), Transition(ti, qo, x, s, _) (2) 

reacht^ (p, c, s) :- reacht^ (p, x, p), ILS(a;, c), Type(p, L), Edge(p, c, w), (3) 

Transition)^, y, w, s, _) 

acceptj. (p) :— reachtj(p, c, s), FLC(p, _, c), Type(p, ti), Transition)^, _, _, s, true) 
invalidt . :— FLC)p, _), -laccept^. )p) 

The boolean view invalidj. evaluates to true iff there is some element p of 
type ti whose contents are invalid with respect to r^. The recursive view reach^^ 
simulates the automaton using the labels of the children of each element p as 
follows. The simulation starts with rules )1) or )2); rule )1) applies only if the 
element has no content )the constant qo denote the starting state of the automa- 
ton for ri). Rule )3) carries out the recursion over the children of p. Finally, the 
acceptt^ rule checks that the computation is accepting if the state s reached after 
inspecting the last child c )note that c could be # if the element has no content) 
is accepting. 

5.2 The Mapping Function 

As in Edge, we define a recursive function that maps all nodes in the document 
tree. Besides that, the mapping procedure in Edge++ also assigns types to the 
elements in the document. Of course, the types assigned to elements must match 
those stored in the Transition relation; for concreteness, we assume that type ti 
is represented by the integer i. The mapping of the document is then defined by 
the following function shown in Figure 3. 
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define procedure map_node($e as node){ 
let $cc := $e/* 
for $n in $cc 

if ($n instance of textO) then 

sql INSERT INTO Value VALUES (id($e) , data($e)) end; 
else 

if ($n instance of element ()) then 

if count($cc)=0 then (:the element has no children:) 
sql INSERT INTO FLC VALUES (id($e) ) end; 
else 

let $first := $cc [1] , $last := $cc [count ($cc)] 
for $c at $i in $cc 

sql INSERT INTO Edge VALUES (id($e) , id($c) , name($c)); 

INSERT INTO FLC VALUES (id($e) , id($first) , id($last)); 

end; 

if ($i > 1) then 

let $j := $cc[$i-l]; 

sql INSERT INTO ILS VALUES (id($c) , id($j ) ) end; 
else () 
mapjiode ($n) ; 



mapjiode (doc ( " doc. xml" ) ) ; 

(: the following assign the types to elements :) 

for $e in $d/fi sql INSERT INTO Type VALUES (id($e),l) end; 

for $e in $d/f„ sql INSERT INTO Type VALUES (id($e) ,n) end; 



Fig. 3. Mapping function for Edge^"^. 



5.3 The Publishing Function 

The publishing function tt in Edge’*'’*' is straightforward. Note that publishing a 
single element e of type x is done by visiting all elements in the published view 
that have e as parent. In order to retrieve the elements in their original order, we 
use a recursive function that returns the next sibling according to the successor 
relation stored in ILS. In order to publish a subtree one can recursively apply 
the simple method above. 

The following is easy to verify: 

Proposition 2 Let X be a document schema, then the Edge'^'^ mapping scheme 
pi = {a,TT,S) is lossless and validating with respect to X. 

6 Updates and Incremental Validation 

Any update (i.e., SQL transaction) over an Edge’*"'*' mapping succeeds only if 
the resulting database satisfies the constraints in the schema (i.e., represents a 
valid document). Evidently, testing all constraints after each update is inefficient 
and, in most cases, unnecessary [15]. In this section, we discuss efficient ways of 
checking the validity of the Edge’*'’*' constraints in the presence of updates. We 
start by briefly discussing a simple update language for XML documents which 
can be effectively translated into SQL transactions over Edge^^ mappings. 
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6.1 The Update Language 

The update language provided to the user has a significant impact on the per- 
formance of a mapping-based storage system. In particular, there may be con- 
straints in the relational schema that can be dropped if the user is not allowed 
to issue arbitrary SQL updates over the mapped data. 

Consider the constraint in the Edge mapping scheme discussed in Example 1 
that specifies that “every element that appears in the Value relation must also 
appear in the Edge relation”. This constraint ensures that every ^^^PCDATA 
value stored in the database is the content of some element in the document. 
Note that this constraint is necessary only if the user can write arbitrary SQL 
update statements that modify the Value relation, but can be dropped if inserting 
elements with textual content is always done by a transaction that inserts tuples 
in Edge and Value relations at the same time. 

We note that it is reasonable to assume the user updates are issued in some 
XML update language, and these are translated into SQL transactions in a way 
that the “structural” constraints are preserved, so they need not be checked after 
each transaction. 

Update Operations. To date, there is no standard update language for XML, 
and proposing a proper update language is outside the goals of this work; instead, 
we use a minimal set of atomic operations, consisting basically of insertions and 
deletions of subtrees. 

— Append(p,j/), where both p and y are elements, results in inserting y as the 
last child of p; 

— InsertBefore(a:,?/), where both x and y are elements, results in inserting y 
as the immediate left sibling of x; this operation is not defined if x is the 
root of the document being updated; 

— Delete(x), where x is an element, results in deleting x from the document. 

For the Edge^"*" mapping scheme, it is straightforward to translate the primi- 
tive operations above into SQL transactions in a way that preserves structural 
validity. 

6.2 Incremental Checking of Validating Constraints 

The validating constraints in Edge^“'" mappings are recursive datalog programs 
that test membership in regular languages, which is a problem that has been 
shown to have low complexity. Patnaik and Immerman [20] show that member- 
ship in regular languages can be incrementally tested for insertion, deletion or 
single-symbol renaming in logarithmic time. By viewing the sequence of children 
of an element as a string generated by the regular expression for that element, 
Barbosa et al. [3] give a constant time incremental algorithm for matching strings 
to certain 1-unambiguous regular expressions. The classes of regular expressions 
considered in that work capture those most commonly used in real-life document 
schemas. 
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document size 



Fig. 4. Updates and incremental validation times. 



Essentially, the idea behind the algorithms in [20,3] is to store the paths 
through the automata (i.e., sequence of states) together with the strings. Test- 
ing whether an update to the string is valid amounts to testing whether the 
corresponding path can be updated accordingly, while yielding another accept- 
ing path. As shown in [3], this test can be easily implemented as an efficient 
datalog program. 



7 Experimental Evaluation 

We show preliminary experimental results for the incremental maintenance of 
the validation constraints as defined in Section 5.1. Our experiments were run 
on a Pentium-4 2.5 GHz machine running DB2 V8.1 on Linux. We used several 
XMark documents of varying sizes (512KB, 4MB, 32MB, 256MB and 2GB); we 
note that the regular expressions used in the XMark DTD follow the syntactic 
restrictions discussed in the previous section, and, thus, are amenable to the 
simple algorithm discussed. 

The implementation used in our experiments uses a more efficient relational 
schema than the one discussed in Section 5, which consists of horizontally parti- 
tioning the Edge relation based on the type of the parent element in each edge. 
That is, for each element type U, we define Edget^(p, c, /):-Edge(p, c, l),Type{p, U); 
the Transition table is partitioned in a similar fashion. Note that this results in 
eliminating some of the joins in the validating constraint. The workload in our 
experiments consists of 100 insertions and deletions of items for auctions in the 
North America region, each performed as a separate transaction. Each element 
inserted is valid and consists of an entire subtree of size comparable to those 
already in the document, and each delete operation removes one of the items 
inserted. 

Figure 7 shows the times for executing the insertions, deletions and for in- 
crementally recomputing (and updating) the validation constraints. The graph 
shows that, in practice. Edge"'""'" can achieve good performance: per-update costs 
are dominated by SQL insert and delete operations; and the costs scale well with 
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document size, which is consistent with the results in [3]. It is worthy of note 
that, in this experiment, all transactions were executed at the “user level”, thus 
incurring overheads (e.g., compiling and optimizing the queries for incremental 
maintenance of the validating constraints) that can be avoided in a native (inside 
the database engine) implementation. Even with these overheads, the cost for 
maintaining the validating constraints is roughly 10 times smaller than the cost 
of performing the actual update operations. 



8 Conclusion 

In this paper we have proposed a simple formal model for XML-to-relational 
mapping schemes. Our framework is based on classes of mapping schemes de- 
fined by the languages used for mapping the documents, specifying relational 
constraints, and publishing the databases back as XML documents. We intro- 
duced a class of mappings called XVS, which captures all mapping schemes in 
the literature. We proposed two natural notions of information preservation for 
mapping schemes, which ensure that queries and valid updates over the doc- 
uments can be executed using these mappings. We showed that testing either 
property for XT>S mappings is undecidable. Finally, we have proposed a lossless 
and validating XT>S mapping scheme, and shown, through preliminary experi- 
mental evaluation that it performs well in practice. 

We are currently working on designing information preserving transforma- 
tions to derive lossless and validating mapping schemes from other such map- 
pings. We have observed that virtually all mapping transformations proposed 
in the literature (e.g., inlining) can be modified to preserve the losslessness and 
validation properties in a simple way. We also hope that our simple formalization 
can be used by other researchers for studying other classes of mapping schemes, 
using other languages and possibly other notions of document (or database) 
equivalence. 
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Abstract. We study the problem of finding relevant relationships among user de- 
fined nodes of XML documents. We define a language that determines the nodes 
as results of XPath expressions. The expressions are structured in a conjunctive 
normal form and the relationships among nodes qualifying in different conjuncts 
are determined as tree twigs of the searched XML documents. The query exe- 
cution is supported by an auxiliary index structure called the tree signature. We 
have implemented a prototype system that supports this kind of searching and we 
have conducted numerous experiments on XML data collections. We have found 
the query execution very efficient, thus suitable for on-line processing. We also 
demonstrate the superiority of our system with respect to a previous, rather re- 
stricted, approach of finding the lowest common ancestor of pairs of XML nodes. 

1 Introduction 

A typical characteristic of XML objects is that they combine in a single unit data values 
and their structure. Such encapsulation of data and structure is very convenient for 
exchanging data, because the separation of schema and data instances in traditional 
databases might cause problems in keeping proper relationships between these two 
parts. 

XML is becoming the preferable format for the representation of heterogeneous 
information in many and diverse application sectors, such as electronic data interchange, 
multimedia information systems, publishing and broadcasting, public administration, 
health care and medical applications, and information outside the corporate database. 
This widespread use of XML has posed a significant number of technical requirements 
for storage and content-based retrieval of XML data - many of them are still waiting for 
effective solutions. In particular, retrieval of XML data, based on content and structure, 
has been widely studied and the problem has been formalized by the definition of query 
languages such as the XPath and XQuery. Consequently, most of the implementation 

* This work was partially supported by the BCD project (Extended Content Delivery), funded by 
the Italian government, by the VICE project (Virtual Communities for Education), also funded 
by the Italian government, and by DELOS NoE, funded by the European Commission under 
FP6 (Sixth Framework Programme). 
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effort have concentrated on the development of systems able to execute queries expressed 
in these languages. To tackle the problem of structural relationships, the main stream of 
research on XML query processing has concentrated on developing indexing algorithms 
that respect the structure. Though other alternatives exist, the structural or containment 
join algorithms, such as [ZND-tOl], [LMOl], [SAJh-02], [CVZh- 02], and [BKS02], are 
most popular. They all take advantage of the interval based tree numbering scheme. 

However, many other research issues are still open. A new significant problem is 
that of searching for relationships among XML components, that is either nodes and/or 
values. Indeed, there are many cases where the user may have a vague idea of the XML 
structure, either because it is unknown, or because it is too complex, or because many 
different, semantically close or equivalent, structure forms are used for XML coding. In 
these cases, what the user may need to search for are the relationships that exist among 
the specified components. For instance, in an XML encoded bibliography dataset, one 
may want to search for relationships between two specific persons to discover whether 
they were co-authors, editors, editor and co-author, cited in the bibliography, etc. In all 
these cases, the user may obviously have problems with languages that require to specify 
(as precisely as possible) the search paths. Any vagueness may result in a significant 
imprecision of search results and certainly in an undesirable increase of the computa- 
tional costs. For example, provided that the schema is known, a very complex XQuery - 
taking into account all possible combinations of person roles - or several queries should 
be expressed in order to obtain the same result, with an obvious performance drawback. 
In Section 4.3 we give a real example on a car insurance policy application. This will 
show the suitability of our approach both in terms of the expressiveness of the query 
language and the performance efficiency. 

In fact, there have been attempts to base search strategies on explicitly unknown 
structure of data collections. Algorithms for the proximity search in graph structured data 
are presented in [GS98]. The objective is to rank retrieved objects in one set, called the 
Find set, according to their proximity to objects in another given set, called the Near set. 
Specifically, applications generate the Find and Near queries on the underlying database. 
The database (or information retrieval) engine evaluates the queries and passes the Find 
and Near object result sets to the proximity engine. The proximity engine then ranks the 
Find set using available distance function represented as the length of the shortest path 
between a pair of objects from the Find and Near sets. In [GS98] , the formal framework is 
presented and several implementation strategies are experimentally evaluated. However, 
the performance remains the main problem. 

The Nearest Concept Queries from [SKWOl] are defined for fhe XML dafa collec- 
tions, and fhe queries use advantage of the meet operator. For two nodes o\ and 02 in 
given XML tree, the meet operator, meet(oi, 02 ), simply returns the lowest common an- 
cestor, l.c.a., of nodes o\ and 02 . Such node is called the nearest concept of nodes Oi and 
02 to indicate that the type, i.e. the node’s tag, of the result is not specified by fhe user. 
Though exfensions of fhis operator to work on a set of nodes are also specified, reported 
experiments only consider pairs of nodes. Further more, the response time dramatically 
depends on the actual distance between the nodes. 

Very recently, a semantic search engine XSEarch has been proposed in [CMK-t03]. 
It is based on a simple query language that is suitable for a naive user, and it returns 
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Fig. 1. Preorder and postorder sequences of a tree 



semantically related document fragments that satisfy the user’s query. By application of 
the Information Retrieval techniques, the retrieved results are ranked by estimated query 
relevance. 

Our approach can be seen as an extension or a generalization of [SKWO 1 ] and [GS98], 
assuming the XML structured object collections. In principle, the result of a relationship 
(or structure) search query can be represented as a set of twigs of the searched data. In this 
way, all structural relationships, not only the l.c.a., are encapsulated and made available 
to the user for additional elaboration or ranking. To achieve high performance, we use 
the tree signature concept [ZAR03], which is a compressed XML tree representation 
supporting efficient search and navigation operations. The signatures have already been 
applied to the traditional XML searching with good success - see [ZAD03] for the 
ordered and [ZMM04] for the unordered inclusion of query trees in the data trees. 

In the following, we summarize the concept of tree signatures in Section 2 and 
define their analytic properties. In Section 3, we specify the structure search queries 
and elaborate on procedures for their efficient evaluation. We describe our prototype 
implementation in Section 4 where we also report results from experimentation. Final 
discussion and our future research plans are in Section 5. 

2 Tree Signatures 

The idea of the tree signature [ZAD03,ZAR03,ZMM04] is to maintain a space effective 
representation of the XML tree structures. Formally, the XML data tree T is seen as an 
ordered, rooted, labelled tree. To linearize the trees, the preorder and postorder ranks 
from [Die82] are applied as the coding scheme. For illustration, see the preorder and 
postorder sequences of a sample tree in Figure 1 - the node’s position in the sequence 
is its preorder/postorder rank, respectively. 

The basic (short) signature of tree T with m = \T\ nodes has the following format 

sig{T) = {ti,post{ti);t 2 ,post{t 2 ); ■ ■ ■ ;t^,post{tm)), 

where L represents the name of node with preorder i and postorder post{ti). A more 
rich version of the tree signature, called the extended signature, is a sequence 



sig{T) = {ti,post{ti), ff{ti), fa{ti ); . . . ; t^,post{tm), f 
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descendants D{U) = {tj\i < j < ff{U)}, 
with \D{ti)\ = size{ti) = ff{U) — i — 1; 
following F{ti) = <j< m}, 

with \F{ti)\ ^m + 1- 

ancestors A{ti) — {tj\j < i/\post{tj) > post{U)}, 
with \ A{ti)\ = levditi) = ff{ti) — post{U) — 1; 
preceding P{ti) = {tj\j < i/\post{tj) < post{U)}, 
with \P{ti)\ = i +post{ti) - ffiti). 



Fig. 2. Properties of the preorder and postorder ranks. 



where ff{ti) is the pointer to (preorder value of) the first following node, and fa{ti) 
refers to the. first ancestor, that is the parent node of ti. If there are no following nodes, 
= tn + 1. Since the root ti has no ancestors, fa{ti) = 0. For illustration, the 
extended signature of the tree from Figure 1 is 



(a, 10, 11, 0; b, 5, 7, 1; c, 3, 6, 2; d, 1, 5, 3; e, 2, 6, 3; . . . ; h, 8, 11, 7; o, 6, 10, 8;p, 7, 11, 8) 



Given a node ti with pre{ti) = i, all the other nodes can he divided into four disjoined 
subsets of computable cardinalities, as illustrated in Figure 2. By definition of the nodes’ 
ranks, the descendant D nodes (if they exist) form a continuous preorder as well as a 
postorder sequences. Further more, the following F nodes end the preorder sequence 
and the preceding P nodes start the postorder sequences. So there is actually some 
empty space in the prelpost plain as highlighted by the dark area in Figure 2 (left). In 
the preorder sequence, the ancestor A nodes interleave with the preceding nodes, in the 
postorder sequence, the ancestor nodes interleave with the following nodes. 

As demonstrated in [ZAD03], the signatures also efficiently support execution of 
tree operations such as the leaf detection, path slicing, (sub-)tree inclusion tests, and 
many others. The lowest common ancestor, l.c.a., of nodes ti and tj is the node with 
highest preorder of {A{ti) fl A(fj)}. Assuming i < j, an efficient algorithm to find 
l.c.a. recursively follows the fa pointer to find the first node tk, such that ff{tk) > j- 
Without any increase of complexity, this strategy can easily be generalized to finding 
the l.c.a. of n nodes. 

A sub-signature, subsigs{T), is a specialized (restricted) view of T through sig- 
natures, which retains the original hierarchical relationships of nodes in T, but it is 
not necessarily forming a tree. Considering sig{T) as a sequence of individual entries 
representing nodes of T, 



sub.sigs{T) = {tsi,post{tsfi}-,ts 2 iPost{tsfi)] . ■ . ]ts^,,post{tsJ) 

is a sub-sequence of sig{T), defined by the ordered set S = {si, S 2 , • . . Sk} of indexes 
(preorder values) in sig{T) with 1 < Sj < m for all i. 
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3 Structure Search Queries 

The key construct of most XML query models and languages is the tree pattern, TP. 
Accordingly, research on evaluation techniques for XML queries has concentrated on 
tree pattern matching defined by the pair Query ={Q, C), where Q = {B, E) is the 
query tree in which each node from B has a name (label), not necessarily different, 
each edge from E represents the parent-child or the ancestor-descendant relationship 
between pairs of nodes from B. The constraint C is a formula specifying restrictions on 
the nodes and their properties, including in general their tags, attributes, and contents. In 
order to improve query efficiency, this concept was recently extended in [CJL+03] into 
the generalized tree pattern query, GTP, where edges of the query tree Q can be optional 
or mandatory and each node carries a real number valued score. The tree signature 
concept has already proved to be highly competitive with respect to numerous other 
alternatives to accelerate execution of TP queries both considering the query trees as 
ordered [ZAD03,ZAR03] and also unordered [ZMM04]. 

In this paper, we consider a different concept of query. It does not explicitly declare 
relationships among qualifying data tree nodes. Discovering relationships is actually 
the objective of querying to find hierarchically dependent subsets forming a tree. In the 
following, we first define the query model, then define its evaluation principles, and 
finally specify an efficient algorithm for the query execution. 

3.1 Structure Search Query Model 

We define the structure search query SS-Q as a conjunctive normal form of node spec- 
ification expressions Ef as 

SS^Q = (ifi V . . . V J A . . . A V . . . V O, 

where each Ef is an XPath expression determining candidate nodes to be used as a 
starting point for relationships discovery. The result of such a query is the set of twigs, 
of the searched data tree, where structural relationships existing among nodes qualifying 
Ef of different conjuncts are made explicit. Individual nodes can be constrained by full 
or partial name-path specifications as well as by content predicates. Observe that an 
Ef expression might also search for all elements having a specific content, no matter 
what is the name of the element, or it might search for all elements of a certain name, 
independently of their content and structural relationships in the data tree. 

Provided the XML schema is known, traditional XML processing tools, e.g. XQuery, 
can also determine instances of all such structural relationships. However, this would 
require execution of multiple TP queries, each of which considering a specific node 
relationships constraint. Our approach has a higher expressive power, because it does 
not require a priori knowledge of the structure of the XML documents and, as we will 
see later, offers higher processing performance. 

More formally, the answer to a structure search query SS-Q is a set of sub-trees (or 
twigs) of the data tree, obtained as follows. Let T be the data tree and SS-Q a structure 
search query consisting of k conjuncts. Let be a pattern ofk nodes of T such that: 

- each node qualifies in a different conjunct, and 
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- all the nodes share a common ancestor. 

Then, a qualifying twig T'^ is the sub-tree of T induced by R'^. consists of all 
nodes of , but can also include additional, induced, nodes, which are the ancestors 
of nodes from R^ up to their l.c.a., that is the root of the twig. Since is a sub-tree 
of T, the relationships of nodes in T'^ are the same as in T. In the following, we denote 
the pattern of induced nodes as . For formal manipulations, we see the patterns 
and as well as the twig T'3 as sub-signatures of sig{T), designated, respectively, as 
sub-sigsR{T) and subsigsi {T). For brevity, we use sub.sigs{T) whenever explicit 
distinctions between the patterns and twigs are not necessary. 

For illustration, consider the following query. 

(//person [name=John] V //person [name= Jack] ) 

A (//person [name=Ted] ). 

The qualifying patterns are pairs, because the query consists of two conjuncts. Fur- 
thermore, one node of the pair is always a person with name Ted and the other is a 
person with name John or Jack. However, the resulting twigs can be quite different even 
for a specific pair of nodes. For example. Jack and Ted can appear below the element 
<author>, which implies that they are coauthors of the same article. But they can have 
the <journal> element as their l.c.a. where several alternatives for twigs can occur. 
Two persons can be authors of different journal papers, but they can also be editors of 
this journal, or one of them can be the editor and the other the author. 



Level constrained structure search. In some cases, it might be desirable to find twigs 
with the root at level of at least certain value - we assume the document’s root to be at level 

0. For example, consider the DBLP XML data set [DBLP]. It consists of just one large 
XML document (file), having the <dblp> element as its root. In this case, structure search 
queries would typically produce a large set of results, because all possible query patterns 
have at least the <dblp> element as the root. Further more, the fact that two names with 
the l.c.a. <dblp> appear in the DBLP bibliography is probably of low significance. It 
would be more useful to search for structural relationships excluding the <dblp> root 
element, that is searching for twigs with the root element on levels greater than 0. We 
denote such level constrained structure search queries as indicating that 

the level of the roots of the qualifying twigs should be greater than or equal to lev min- 

3.2 Query Evaluation Principles 

Assume a collection of XML documents and an query. Suppose each con- 

junct produces a non-redundant list Li,i= 1,2,... , fc of preorder values of qualifying 
nodes in T. The central problem of the query evaluation is to determine the query pat- 
terns R'^ as the sub-signature subsigsri{T) \ = {si, S 2 , • ■ • , Sfe}, which satisfies 

the following properties: 

1 . each Sj is from exactly one list Li and no two instances are from the same list; 

2. the constraint si < S 2 < • ■ • < Sfc is satisfied; 

3. the l.c.a. of {si, S 2 , • ■ • , Sfe} exists on level > levmin- 
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For example, if the query is (//g) A (//f ) A {/ /c), then the qualifying pattern in tree 
T from Figure 1 is subsigsR{T) = (c, 3; t/,4; /, 9), because = {3, 6, 7}. The l.c.a. 
of nodes c, g, and / is a, that is a node at level 0. 

3.3 Structure Search Query Evaluation Strategies 

The query evaluation proceeds in the following four phases: 1) evaluation of the ex- 
pressions Ef ; 2) evaluation of the conjuncts; 3) generation of the node patterns R'^ 
4) generation of resulting twigs . This sub-section discusses efficient execution of 
phases 1), 2), and 3). Section 3.5 concerns the evaluation of the phase 4). 

The evaluation of each expression Ef returns a list LEf of nodes qualifying for the 
corresponding XPath expression. We suppose that these lists are ordered according to 
the preorder ranks. The efficient evaluation of expressions Ef and ordering of the lists 
LEf can be obtained by using XML path indexes and/or the XML tree signatures, as 
discussed in [ADRh-03,ZAD03]. 

The result of the j-th conjunct is the union of the lists LEf for all i and a specific 
j. Since the lists LEf are ordered with respect to the preorder ranks, multiple merge of 
corresponding lists ensures efficiency of fhis procedure. This will produce k lisfs Li of 
nodes, one for each conjunct, still ordered by the preorder ranks. 

A naive way to obtain patterns is to first produce the Cartesian product of the k 
lists Li, and then eliminate those fc-tuples that do not satisfy the conditions defined in 
Section 3.2. However, the cost of such process can be very high, because the Cartesian 
product can produce many tuples among which only a few are finally qualifying. Such 
approach also assumes complete k lists to be available. In the following, we propose a 
new algorithm, called the structure join, which performs this step of query execution 
efficiently. The algorithm produces all patterns R^ . 



Table 1. Structure Join Algorithm 



1: procedure StructureJoin(Li ,.. . , Lf,,levrnin) 

> L I, .. . , Lfc are the lists with elements sorted with respect to preorder numbering; 

> lev-min is the minimum level accepted for roots of qualifying twigs; 

> Li{j) is the j-th element in list Li ; 

2] result 0; 

3'. if (3z : Li = 0) then return result-, 

4: else 

5'. Ci l[V 2 ,z= 1,... , k]- > Cursors pointing to current tops of the lists 

6i P := max{ pre{Li{ci)) \ i = 1, . . . , k}- 

7: M C { j| pre(Lj{cj)) — PA 1 < J < k}; 

8i UM C {a| a C ancestor s{L M (cm)) A level(a) — levmin}', 

9l — null) then cm ■— cm + 1; goto step 6; end if 

10: if (3z : pre{Li{ci)) < pre{aM)) then 

11: Ci min{c\ pre{Li{c)) > pre(aM)}[V 2 : i = 1, . . . , k]- 

12: if(3z : Ci > Zenpfh(Li)) then return result; else goto step 6; 

1 3 : else 

14: pIm •= f /{ o-m) — > P^M is the preorder of the last node of the sub-tree rooted at um 

15: SLi '.= {Li{c)\ci < c A pre{Li{c)) < p/m}[Vz : i = 1, . . . , k]-, 

16: <Generate sub-signatures of length k with each element from a different SLi; put them in result > 

17: Ci min{c\pre{Li{c)) > : i — 1, . . . , k]; 

18: if(3z : Ci > length(Li)) then return result; else goto step 6; 

19: end if 

20: end if 

2 1 : end procedure 
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The Structurejoin algorithm specified in Table 1 takes as input k lists of qualifying 
nodes, with elements sorted with respect to preorder numbering, and the level constraint 
le-Vmin- The algorithm avoids the unnecessary generation of not valid sub-signatures 
by restricting the computation of the Cartesian product to portions of the lists Li that 
contain elements which belong to the same potentially qualifying sub-tree. Step 3 checks 
for an empty list, which terminates the algorithm. Then cursors Cj are set to refer the 
first element of the lists in Step 5. Steps 6 and 7 choose M as the index of the list that 
has the element Lm{cm) with the maximum preorder. Provided a pattern of nodes with 
structural relationships with Lm{cm) exists, the corresponding sub-signatures will have 
Lm{cm) as the last node. Step 8 determines the ancestor om of Lm{cm) which has 
level levmin- The ancestor can be efficiently obtained by using the tree signature. This 
ancestor is the root of the sub-tree containing all possible nodes that can be joined with 
Lm{cm) (i-C- having valid structural relationships with Lm(cm))- If a valid ancestor 
cannot be found, i.e. level{LM{cM)) < the element is discarded by moving the 

cursor forward (Step 9). Step 10 checks that the top element of all lists belongs to the 
sub-tree rooted at om, that is, checks that the preorder of the top elements is greater than 
or equal to pre{aM)- If the top element of at least one list is smaller than pre(aM), then 
the cursor of that list should be moved to the first element with preorder greater than 
pre{aM) (Step 1 1) and if the end of the list is reached, the algorithm ends (Step 12). If 
the top element of all lists is greater than preitiM) then Step 14 uses the first following 
pointer, contained in the signatures, to compute pIm, the preorder of the last node in the 
sub-tree rooted at om ■ Note that at this point no list can have a top element with preorder 
greater than pIm, because that list should have been selected at Steps 6 and 7. All nodes 
in the lists that have preorder included between pre(oM) and pIm are used to generate 
qualifying sub-signatures, that is, the Cartesian product will be computed only using 
the consecutive portions of the lists corresponding to elements having preorder between 
pre{aM) and pIm- In fact. Step 15 determines the set of elements in each list that should 
be used for the Cartesian product - sub-lists SLi always form a continues sequence in 
corresponding lists Li . Step 1 6 computes the Cartesian product by generating all possible 
combination of nodes belonging to the sub-tree rooted at om present in the lists, and 
arranges the obtained tuples to form valid sub-signatures. Step 17 moves the cursor to 
point to the new top element corresponding to the next sub-tree in each list. If the end 
of at least one list is reached, the algorithm ends (Step 18). 

Example: We illustrate the behavior of the algorithm with an example. Figure 3 
shows a data tree template (on the left) and the manipulation process of the joined lists 
(on the right). Suppose a structure search query with three conjuncts, where the phases 
1) and 2) have produced the ordered lists Li, L 2 , and L 3 . Sub-trees involved in the 
algorithm execution are labelled from 1 to 4. The nodes of these sub-trees contained 
in the lists are highlighted with rectangles also labelled as the corresponding sub-trees. 
Finally, suppose to be at the beginning of a generic iteration of the algorithm with cursors 
pointing to positions c\, C 2 , and C 3 . Steps 6 and 7 choose such that the preorder of 
Lmi {c\fi ) is the maximum (i.e. the rightmost) among the elements pointed by the list 
cursors. Step 8 determines a^i as the highest possible ancestor of such a node. Elements 
with admissible structural relationships with (c^i ) should be in the sub-tree rooted 
at omi • Step 10 detects that there are lists where the first element is not in the currently 
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considered sub-tree. In fact, the first element of list L\ is from sub-tree 1. Note that in 
order to have valid structure relationships involving elements of suh-tree 1 , all other lists 
must have elements of that sub-tree. However, the fact that cursors of the other lists refer 
elements of sub-list 3 means that no elements from sub-tree 1 were found in previous 
iterations. Step 1 1 moves cursors of the lists to point the first element with preorder 
greater than pre(oMi)- Note that also elements of sub-tree 2 in list Li are skipped at 
this step. Now, Step 6 and 7 chose the new rightmost element which now 

belongs to sub-tree 4. Step 8 determines qm^ , which now is the root of sub-tree 4. Step 
10 determines that there are lists whose first element does not belong to sub-tree 4 (both 
lists L 2 and L 3 have the first element from sub-tree 3). As before. Step 11 moves the 
cursors forward to refer elements after ■ Current situation is that now all lists have 
elements from sub-tree 4. Therefore the algorithm arrives at Step 14 were the preorder 
pl]\ 4 ^ of the last element of sub-tree 4 is determined and the Cartesian product of sub-lists 
with elements of sub-tree 4 is performed at Step 16. Step 17 moves the cursor to the next 
sub-tree in the lists and goes for a new iteration. 



3.4 Algorithmic Complexity Considerations 

It is important to point out that this algorithm computes the Cartesian products of small 
(continues) portions of the lists, provided the width of the sub-trees is small. In fact, just 
elements belonging to the same sub-tree are joined. 

From this observation we can show that the complexity of our algorithm, in the 
average case, can be considered linear with respect to the dimension of the input. In 
fact, the complexity of a single Cartesian product is equal to the product of the sizes of 
the sub-lists SLi. This can be realistically bounded by the average size of the sub-trees. 
Let’s call Savg that size. Then, the complexity of the Cartesian product is 0{{savg)’^)- 
The number of Cartesian products computed is bounded by the number of sub-trees, 
because the algorithm computes at most one Cartesian product for every sub-tree. The 
number of sub-trees can be estimated as njsavg, where n is the total number of elements 
in the dataset. Therefore, the complexity of the algorithm is 

0{{Savg)'' X (n/Savg)) = 0{{Savg)^~^ X n) . 

In case that the average size of the sub-trees is much smaller than the size of the dataset 
(savg ^ n), which is true for large databases, the complexity is linear with the size n 
of the input. Of course, in the worst case, i.e. when the levmin is set to 0 and the data 
set is composed of one large XML file, the result is that Savg = n and the complexity is 
0{n^). However, such kind of searching has very little semantic meaning, as we have 
already explained. 



3.5 Data Twig Derivation 

In order to derive a qualifying twig T'^ for query SSjQ in tree T, we start with the 
sub-signature subsigsR{T) representing the query’s qualifying pattern. Then for each 
ancestor set i = 1 , 2 , . . . , fc, we determine a sub-signature subsigSi{T) of the 
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Fig. 3. Structure Join execution example 



path from the node to the l.c.a. of all nodes in S^. In this way, we obtain sets Si, and 
the qualifying twig is obtained as a tree induced from the pattern as 

sigiT'^) = sub.sigsT{T)\S'^ = U S^, 

considering the sets Si as ordered - the set S^ is also ordered. 

Provided S’^ = {si, S 2 , ■ • • s/i}, the sub-signature sub.sigsT{T) defines a tree twig 
with root ts^ of preorder value si in T. Since the preorder/postorder values in sub- 
signatures are those of the original tree, leaves in sub-signatures are not necessarily 
leaves of T, so they can only be recognized by checking consecutive entries of the sub- 
signature. Specifically, if post{tg ^ ) < post{tsi^^ ) then the i-th entry in the sub-signature 
is a leaf of the twig T'3. Naturally, the last element, in our case, is always a leaf. The 
element needs not be checked, because it is the root of . 

If we continue with our previous example of the structure search query (//g)A(//f) 
A (//c) , resulting in the query pattern sub-signature suh-siggR (T) = (c, 3; g, 4; /, 9) 
\S^ = {3, 6 , 7}, the sub-signatures of the ancestors define the sets Si = {1, 2} (for the 
node c) , S '2 = {1, 2} (for the node g), and S '3 = {1} (for the node /). The union of the 
ordered sets of S^, Si, S' 2 , and S 3 is S^ = {1, 2, 3, 6 , 7}, which is a sub-tree of nodes 
a, b, c, g, / from T, rooted at node a. If we change our query to (//g) A (//c) , we get 
sub-sigsR{T) = (c, 3; (/,4;) |S^ = {3,6}, so Si = {2} and S 2 = {2}, because the 
lowest common ancestor of c and g is b. When we make the ordered union of all these 
sets, we get S^ = {2,3,6}, which defines a sub-tree rooted at node b. 



4 Experimental Evaluation 

In this section we validate the Structure Join algorithm from the performance point 
of view. Our algorithm, as demonstrated in Table 1 , offers very high performance and 
scales very well with the increasing size of the joined lists. The structure join algorithm 
was implemented in Java, JDK 1.4.0 and the experiments run on a PC with a 1800 GHz 
Intel Pentium 4, 5 12 Mb main memory, EIDE disk, running Windows 2000 Professional 
edition with NT file system (NTES). 

We have conducted most of our experiments on the XML DEEP dataset, consisting 
of about 200 MB [DEEP] of data. The size of the signature file is about 30% of the 
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data size. We have verified the performance of the structure join algorithm in three tests. 
First, we have measured its performance using different queries, which have different 
number of conjuncts and different sizes of the input sets (see Section 4.1). Second, we 
have compared the efficiency of our structural join with the meet operator proposed 
in [SKWOl] (see Section 4.2) . Finally, we have run experiments in a real application 
scenario of insurance records (see Section 4.3). 

4.1 Performance Measurements 

The queries that we have used to run the first group of experiments are listed in the 
first column of Table 2. Each query is coded as "QD_n", where D indicate the size of 
the input set (which can be Small S, Medium M, or Large L) and n can be 2 or 3 to 
indicate the number of conjuncts. For all our queries, the level constraint levmin was set 
to 1 (just below the root element that is on level 0). For each query. Table 2 reports the 
size of the corresponding input set, the size of the output set, the number of Cartesian 
products computed (that is the number of qualifying sub-trees containing elements in 
the input set), the average number of iterations executed in each Cartesian product, and 
the elapsed time to complete the structure join in milliseconds. 



Table 2. Performance of Structure Join algorithm. In the last but one column #CP indicate the 
number of Cartesian products and #ICP the average number of iterations for each Cartesian 
product. 



Queries 


#input set 


#output set 


#CP (#ICP) 


Time 

(ms) 


QS_2 


//phdthesis/title A 
//phdthesis/author 


72 

72 


72 


72 (1) 


<1 


QM_2 


//incollection/title A 
//incollection/author 


1410 

2931 


2931 


1400 (2) 


30 


QL_2 


//inproceedings/title A 
//inproceedings/author 


22004 

53243 


53243 


21977 (2) 


392 


QS 3 


QS_2 A 

//phdthesis/year 


72-72 

72 


72 


72(1) 


<1 


QM 3 


QM_2 A 

//incollection/year 


1410-2931 

1410 


2931 


1400 (2) 


37 


QL 3 


QL_2 A 

//inproceedings/year 


22004-53243 

22004 


53243 


21977 (2) 


512 



The reported processing time is obtained as the average over one hundred independent 
executions of the algorithm. It only includes the processing time required for the structure 
join algorithm and it does not include the time needed to obtain the input sets. The 
experiments demonstrate the linear trend with respect to the cardinality of largest input 
set, confirming our expectation of linear complexity. 

As explained previously, a trivial technique to execute the structure search is to 
compute a Cartesian product of the input sets and to eliminate the non qualifying tuples. 
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The complexity of such a strategy would have been polynomial. In particular the number 
of iterations required would have been equal to the product of the cardinalities of the 
input sets. Our algorithm, on the other hand, computes several Cartesian products and 
the number of iterations in each product is small. Therefore, the overall cost of our 
algorithm is linear. 

Note that the number of computed Cartesian products increases linearly with sizes 
of the input sets (given that the number of qualifying sub-trees increases linearly with 
the sizes of the input sets). On the other hand, the number of iterations computed in 
each Cartesian product is independent of the sizes of the input sets, it is very small in 
practical cases (and it can be considered constant). This is explicated considering the 
actual number of Cartesian products computed, the average number of iterations in each 
product, and the elapsed time reported in Table 2. 

For example, the trivial Cartesian product technique would have required 72*72*72 
iterations to process query QS_3, while our technique needs only 72*1 iterations. The 
trivial Cartesian product technique would have required 22004*53243*22004 iterations 
to process query QL_3, while our technique requires just 21977*2 iterations. Note that 
the number of computed Cartesian products (2 1 977) is smaller than the size of the smaller 
input set (22004). This is due to the fact that sometimes elements were discarded, because 
no elements from the same sub-trees were found in the other input sets (inproceedings 
with title and without authors or years). 

4.2 Comparison with Other Techniques 

To compare our algorithm with the meet operator, we have repeated the experiments 
from [SKWOl]. Specifically, we have searched DBLP for the string "ICDE" and for the 
year records, incrementally including years from 1999 to 1984. We have performed the 
structure join on this two sets and we have computed the elapsed time of our algorithm. 
Figure 4 compares the elapsed time of the original meet operator and our algorithm 
varying the size of the input set (obtained in correspondence of the number of years 
included in one input set). The graph on the left side shows the performance of our 
algorithm, while the graph on the right shows the performance of the meet operator. 
Our technique is about two orders of magnitude faster than the meet operator - the time 
scale of the graph on the left is 100 times smaller than that of the graph on the right. For 
instance, with the maximum input set sizes, our technique is 96 times more efficient than 
the meet operator technique. Specifically, our algorifhm processes the query in about 30 
ms, while the meet operator needs about 3 seconds. 

4.3 On-the-Field Experiment 

In addition to the previous performance tests, we have also conducted experiments with 
a dataset related to a more realistic and complex scenario. Figure 5 sketches the XML 
structure of information used by a car insurance company to keep track of the customers 
and their record of accidents, that is an information about the type of accident, its status, 
those involved in the accident as witnesses, those injured, etc. In order to detect possible 
frauds, the insurance company may be interested in discovering all relationships between 
two (or more) of their customers. Examples of requests are as follows: have they been 
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Fig. 4. Comparison between our structjoin algorithm and the meet operator 



involved in many accidents, possibly with different roles (i.e. causing the accident, 
witness, injured person, etc.)? Who were those involved as witness or as injured persons 
or as the insurance owners? 

The size of this dataset is 45 Mb and it contains 10000 insurance policies. In the 
experiment, we have searched for co-occurrences of four specific persons, identified by 
their names, using as the level thresholds (levmin) 0, 1 and 2, corresponding, respectively, 
to elements <Insurance Policy>, <Risk>, and <Accident>. In the first case, we 
implicitly searched for co-occurrences of the four persons either as an owner, witness, 
or subject involved in accidents of the same insurance policy. In the second case, we 
implicitly searched for co-occurrences as a witness or subject involved in accidents of the 
same policy. In the last case, we searched for co-occurrences In the same accident. The 
number of twigs that we have found, i.e. those which satisfy the search criteria, are 6 for 
the level 0, 2 for the level 1, and 1 for the level 2. An interesting observation was that the 
specified persons were somehow involved together in 6 different insurance policies and 
the related accidents with different roles. This could suggest that further investigation 
should be performed by the insurance company on these subjects. The time required for 
processing such queries was, respectively, 7, 5, and 4 milliseconds, confirming the high 
performance of the technique in real application scenarios. 

5 Conclusions 

Contemporary XML search engines reason about the meaning of documents by consider- 
ing the structure of documents and content-based predicates, such as the set of key-words 
or similarity conditions. However, the structure of the documents is not always known, 
so it can become the subject of searching. 

In this paper, we have introduced a new type of queries, called the structure search, 
that allows users to query XML databases with value predicates and node names, but 
without specification of their actual relationships. The retrieved entities are sub-trees of 
the searched trees (twigs), which obviate the structural relationships among determined 
tree nodes, if they exist. The twigs can also contain additional tree nodes, not explicitly 
required by the query, so the transitive relationships are also discovered. Our prototype 
implementation demonstrates that the proposed algorithms yield useful results on real 
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Fig. 5. Schema of the insurance records dataset 



world data and scale well, enabling interactive querying. In this way, our approach can 
be seen as a considerable extensions of previous approaches to proximity searching in 
structured data. With the help of tree signatures, it also has a very efficient implementation 
as needed for processing of large XML data collections available on the web. Results 
are confirmed by systematic experiments on the DBLP dataset. A possible application is 
outlined by the structure search on insurance records including performance evaluation. 

Our future plans concern introducing even more flexibility or vagueness to the specifi- 
cation of expressions determining nodes. For example by considering alternative names, 
such as Author or Writer, which can be automatically chosen from proper dictionaries or 
lexicons. We are also working on developing of ranking mechanisms that would order 
or group the retrieved twigs according to their relevance with respect to the query. In 
this place, the analytic properties of tree signatures will play an indispensable role. 
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Abstract. A corrector takes an invalid XML file F as inpnt and pro- 
duces a valid file F' which is not far from F when F is e— close to its 
DTD, nsing the classical Tree Edit distance between a tree T and a lan- 
guage L defined by a DTD or a tree-automaton. We show how testers 
and correctors for regular trees can be used to estimate distances be- 
tween a document and a set of DTDs, a useful operation to rank XML 
documents. 

We describe the implementation of a linear time corrector using the 
Xerces parser and present test data for various DTDs comparing the 
parsing and correction time. We propose a generalization to homomor- 
phic DTDs. 



1 Introduction 

For classification, search and querying semistructured data, it is important to 
detect approximate validity, i.e. if a file F approximately follows a structure 
M. In the case of XML data, the structure is a DTD or a schema and we 
want to decide if a file F approximately follows a DTD M . A search engine for 
structured data would specify a finite set Mi, ..., M^ of structures and would like 
to efficiently rank all documents by estimating their distances to these DTDs. 
The documents were valid for their own DTD when they were created, but are 
most likely invalid for Mi, ..., Mk because of linguistic differences in tags, small 
changes in the internal structure of DTDs and potential errors in the file. It is 
therefore important to efficiently estimate how close they are to a fixed set of 
DTDs. 

A given file is well-formed if the tags follow a tree-like structure and is valid 
if the tree is accepted by the automaton associated with the DTD. The tool 
Tidy [13] takes an XML file as input and corrects it if it is not well-formed: it 
adds the missing tags and removes the incorrect ones to obtain a well-formed file 
close to the original one. We first describe the implementation of a similar tool 
for validity. It takes a well-formed XML file and a DTD as input and corrects 
it if it is invalid: it adds the missing leaves and subtrees, removes or modifies 
the incorrect ones to obtain a valid file close to the original one. Moreover it 
estimates in linear time the distance between a file and several DTDs which can 
be used to rank documents relative to these DTDs. 

* Work supported by ACI Securite Informatique: VERA of the French Ministry of 
Research. 
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The distance between two XML files is the classical Tree Edit Distance [14,12, 
2] applied to the DOM trees associated with the files. It measures the number of 
insertions and deletions of nodes and edges, and the number of label changes to a 
given node. It generalizes the classical Edit Distance on strings which measures 
the number of insertions, deletions and modifications of characters necessary 
to apply to a string s to obtain a string s'. These Edit Distances have many 
extensions when additional operators are used, such as Moves of substrings. 
Permutations and Cut/Paste operations. In this paper two distances are used 
on unranked ordered trees: the Tree Edit distance and the Tree-Edit distance 
with moves where we also allow to move an entire subtree in one step. 

Correctors are related to the self-testers of [3,4] in the theory of program 
verification. The related notion of property testing was proposed in [11], its ap- 
plications to graphs were studied in [8] and its applications to regular words in 
[1]. These last authors prove that regular properties of words have testers, i.e. 
we can decide by sampling a word in constant time, if it belongs to a regular 
language or if it is far from the language, i.e. at distance more than e • n if n is 
the length of the string. An XML tester for a DTD and some e takes an XML 
file F and decides with high probability in constant time if F is valid or if F is 
e— far from the DTD. In [9], we prove the existence of such a tester for regular 
trees and the Tree Edit Distance with moves. 

A corrector for regular words transforms a word s close to a regular language, 
i.e. at a distance less than e-n into a word s' in the language such that s' is close 
to s. For regular tree properties, i.e. properties defined by tree automata, or by 
DTDs, such correctors have been introduced in [9] for the Tree-Edit distance 
with moves. A basic corrector was introduced in [7] for the classical Tree-Edit 
distance. It is a linear algorithm which takes a general invalid file F and a DTD 
M as input and produces a valid file F' close to F as output, when the distance 
between F and the DTD is constant, i.e. if there are not too many errors. It used 
the notion of a global correction which implied a time linear in n but exponential 
in the number of errors. 

In this paper, we first emphasize a fundamental difference between the two 
Tree Edit distances. If we allow moves, we prove that the distance problem is 
NP-complete and non-approximable. Yet we can approximate it very efficiently 
in the sense introduced by testers and correctors: we can quickly decide if it 
close or far and we can approximate it when it is close. Ordered trees with 
moves behave as unordered trees and this fundamental difference becomes very 
important for XML schemas where operators such as interleave are restricted in 
order to allow for an efficient implementation. 

We give a global presentation of correctors and describe the implementation 
of a local corrector for the classical Tree Edit distance (without moves), i.e. an 
algorithm which is both linear in n and the number of errors but may not yield 
the best correction. We describe its implementation using the Xerces parser and 
give test data for invalid files which follow three DTDs: the first represents deep 
trees, the second wide trees, and the third a combination of deep and wide trees. 
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The main result of the paper is to show that the correction time is always less 
than the parsing time and therefore scales up. 

We discuss the generalization of a corrector to close DTDs. We consider a 
file {F, Ml) where Mi is the DTD of F, and an external DTD M 2 close to 
Ml, having defined the distance between two DTDs. We estimate the distance 
between {F, Mi) and M 2 , when F is close to Mi, M 2 close to Mi and therefore 
F is close to M 2 . 

In section 2, we present the notations, the Tree Edit distances, property 
testing on trees, its connection with the correctors and prove that the distance 
problem for ordered trees with the Tree Edit distance with moves is NP-complete. 
In section 3, we describe the structure of correctors, the implementation of a local 
corrector and give an example. In section 4, we provide general figures on the 
relative parsing and correction time for three DTDs. In section 5, we introduce 
a distance on DTDs and generalize the approach to homomorphic DTDs. 



2 Preliminaries 

A well-formed XML file is composed of two parts: a ranked ordered labelled tree 
and a Document Type Definition DTD or Schema. We define these two objects, 
the Tree Edit Distance which gives a measure between ordered labelled trees and 
between a tree and a DTD, and the basic notions of testers and correctors. 



2.1 Ranked Ordered Labelled Trees 

A ranked labelled ordered tree is a ranked ordered tree with labels, i.e. a structure 
Tri = {F>n, Childi, Label j, root) i<ra,i<p where the domain = {!,..., n} is the 
set of nodes, the binary relation Childi{u,v) is satisfied if u is the i-th child of 
V for i < m and a fixed m and root is a distinguished element of with no 
predecessors. The graph of the Childi{u,v) is a tree, Labelj is a unary relation 
on Dn and the set {Labelj}ij<p is a partition of D„- 

Let £ = {li,l 2 ...lj..dp} be the set of labels also called tags. A DTD defines 
one label as a root and declares a set of rules I : (m) where I is a tag and m 
a regular expression on C, also called the content model m. The content model 
(^PCDATA) indicates a leaf. Each rule I : (m) specifies a transition of a special 
unranked tree-automaton. In this paper we consider bottom-up tree automata 
on labelled trees but we may have taken equivalent models. 

Our implementation uses the standard parsing tools. Sax and Xerces. After 
reading the file, we obtain a DOM tree, which is parsed bottom-up. We describe 
how to handle parsing errors, in order to modify a DOM tree and obtain a valid 
one. We interleave parsing steps with correction steps and obtain two outputs: 
a corrected file and an estimation of the Tree Edit Distance. It is possible that a 
Sax-based corrector would directly give an estimated distance without building 
a corrected file and hence avoid the DOM construction. 
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2.2 The Tree Edit Distance 

The Edit Distance on strings has been introduced in [14] for comparing strings 
and generalized in [12] for trees. The survey [2] describes hardness results for 
unordered trees and classical polynomial algorithms for ordered trees. Basic op- 
erations on ordered labelled trees include: change a label, insertion of a node on 
a given edge (transformed in two edges), insertion of an edge or deletion of an 
edge. 

Definition 1 The distance between T and T' is the minimum number of Tree 
Edit operations necessary to reach T' from T, noted Dist{T,T'). The distance 
between T and a language L, noted Dist{T, L) is the minimum distance between 
Dist{T, T') for T' gL 

We say that two trees T and T' are fc-close if their distance is less than k, 
and that T is e-close to a DTD if the distance between T and the language L 
associated with the DTD is less than e-n, if T has n nodes. T is e-far from a DTD 
if it is not e-close. It is a classical observation that the distance is P-computable 
for ordered trees and NP-complete on unordered trees. In the context of XML, 
several papers [10,5] estimate similarities between files using variations of the P- 
algorithm based on dynamic programming, an O(n^) algorithm. The methods 
we introduce can decide if the distance to a DTD is small or large in constant 
time (depending on e) but independent of n when we use a tester. A corrector 
produces a corrected file in linear time. 

Distances with moves. On strings, a move is simply the possibility to isolate 
an arbitrary substring t and a position z in a string s of size n and to move t in 
position i, shifting the word to the right in one step. The new Edit distance, called 
Edit distance with moves introduced in [6] for strings has many applications in 
the database streaming model. In the case of trees, a move is the possibility to 
isolate a subtree t and a leaf z in a tree T of size n. We move t to position z in 
one step, as shown in the figure below. 




Fig. 1. Tree Edit Distance with moves. 



This construction is indeed feasible in the DOM tree model by modifying 
two edges. The distance problem changes completely as the next result shows 
because ordered trees with moves behave as unordered trees. 

Proposition 1 The distance problem on ordered trees with the Edit distance 
with moves is NP-complete. 
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Proof. We reduce an NP-complete problem 3SM (3-Set Matching or Exact Cover 
by 3-Sets) to the distance problem on unranked trees. We use unranked trees sim- 
ilar to the ones mentioned in [2] where it is shown that the distance problem on 

unordered trees is NP-complete. Consider a set [/ = {ui,U 2 , ttsfe} and n sets 

of 3 elements of U, i.e. Si = {u\^, ui ^ , U 13 }, ■ ■ ■ ,Si = {ui ^ , , Uig },..., S'„ = 

, tt„ 3 } . We have to decide if there are k subsets , . . . , which par- 
tition U, i.e. whose intersection is empty and their union is U. 

Consider the two unranked trees below, T\ and T 2 on an alphabet E = 
{ui, M 2 , • ■ • , Msfe, S'!, . . . , 5'n, a, t}. The tree Ti is built from the n sets = 
{■U 13 , , ^ 13 S'i = {Mil , m *2 , WJ 3 5'n = {Uni, u„^ , u„^ } as in Figure 2 

and the distance between the node Si and the three leaves is n -I- Sfc.The tree 
T 2 has Ml, . . . ,M 3 fe as successors and a node labelled t with 3(n — k) leaves at 
distance n -I- 5fc labelled t. 

If we can partition U with k such sets, we can use k delete of edges (a. Si) 
followed by k moves to place the Ui.^,Ui.^,Ui^ to the left. We reorder all the Uij 
and use up to 3k moves to obtain the left part of T 2 . We then use n — k merges 
on the other sets to obtain the right part of T 2 and 3(n — fc) -I- 1 modifications of 
labels after we change all the labels to t. In the worst-case we make 2k + 3k + 
{n — k) + 3{n — k) + l = ‘in + 2k+l=p operations. 

If the instance of 3SM is positive, the distance between Ti and T 2 is less than 
p. If the instance of 3SM is negative, let us show that the distance is greater 
than p. The transformation of the second part of the tree requires 3{n — k) + 1 
operations. If we select any subset of size k, there will be at least one duplicate 
Ui which requires n + 5k operations to remove and in the best case, we apply 
k + n + 5k delete. The distance is at least 4n -I- 3fc -I- 1. This proves that the 
distance problem on unranked trees with an infinite alphabet is NP-complete. 

To reduce the problem to a finite alphabet by replacing nodes labelled by a 
letter of 27 = |mi, M 2 , . . . , usk} by a binary tree of size log(3fc) and suppressing the 
labels Si, ... ,Sn- Construct larger trees T{ and T 2 from Ti and T 2 by replacing 
the nodes labelled m^j , m^j , Ui^ by such finite trees. We keep the same construction 
but instead of 3 (n — A:) -I- 1 modifications of labels, we first need to apply 3(n — 
k) log(3/c) deletions of edges to remove the finite binary trees, and then apply 
the 3.{n — k) + 1 modifications of labels. We just take p' = 2k + 3k + {n — k) + 
(3(n — k) + 1) log(3fc) = (3(n — fc) -I- 1) log(3fc) + n + 4/c. This concludes that 
the distance problem on unranked trees is NP-complete. If we code the trees T[ 
and T 2 by binary trees T" and T!f using a standard encoding, we notice that if 
(T{,T 2 ) = k then we reduce the distance problem on unranked 

trees to the distance problem on ranked trees. □ 

The argument can be also used to show that the distance problem is not 
approximable as we can maintain an arbitrarily large gap between positive and 
negative instances. Although this distance is hard to compute, it becomes easy 
to approximate in the sense introduced by testers. 
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Fig. 2. Trees Ti and T2- 

2.3 Testers and Correctors 

In the late eighties, the theory of program checking and self-testing/correcting 
was initiated in [3,4]. Correctors in this context take a program, incorrect on a 
few instances, as an oracle and produce a correct program with high probability. 
Many interesting correctors for numerical computations and linear algebra are 
presented in [3]. 

The related notion of property testing was first proposed in [11]: it is an 
approximate randomized test which separates with high probability inputs which 
satisfy a property from those which are far from the property, using the Hamming 
distance to compare structures. For graphs, [8] investigated property testing for 
several properties such as c-colorability which can be approximately tested in 
constant time. In [1], the authors prove that any regular property of strings can 
be tested and this result is generalized for the Edit Distance in [9]. For trees 
with the Tree Edit distance with moves, [9] provides a randomized algorithm A 
which given a tree T of size n and a regular tree language L defined by a DTD 
is such that: A accepts AT G L and A rejects with high probability if T is e-far 
from L, i.e. the distance between T and L is greater than e • n. The time of the 
algorithm is independent of n and depends on e only. Such an algorithm samples 
the DOM tree by selecting a random node and a finite local subtree of size 1/e, 
and finally checks a local property. 

There is a corrector which takes a tree T e-close to L, i.e. at distance less 
than e • n as input and produces a tree T' G L, which is e'-close to T, as output. 

Notice that the Tree Edit distance with moves is hard to compute but we 
can estimate it if it small. First use the tester to detect if it is large. If it is small, 
use a corrector for an estimate. 



3 Correctors for XML 

A Corrector takes an XML file (which may not be valid) and the corresponding 
DTD as input and produces a valid XML file close to the original one and the 
approximate Edit distance, as output. For the Tree Edit distance without moves, 
we don’t know correctors which would correct up to distances e • n but we know 
correctors which correct up to a constant distance. We now present two classes 
of correctors and an implementation for this distance. 
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3.1 Corrector for the Tree Edit Distance 

There are two distinct steps proposed in [7], after having read a file. 

1. Inductive *-Marking of the DOM tree. Follow a bottom-up run 
on the DOM tree and mark with a * the nodes where a parsing error occurs. 

The root of a subtree which contains *-nodes is a *-node if all substitutions 
of * with feasible tags in the suntree lead to a parsing error. 

2. Recursive correction of the *-subtrees. Proceed top-down and 
propose local modifications in the neighborhood of the * node of maximum 
height. Compute the global distance obtained when the leaves are reached 
and choose the modification of minimum distance. 

This algorithm guarantees that the obtained DOM tree is valid and at a 
predictable distance of the original tree if the original file was at distance k to 
the DTD. Notice that the number of * nodes is a first estimate of the distance. It 
is 0{n) if n is the size of the tree, but exponential in the distance k. We therefore 
consider Local Correctors where a local correction is made at each * node, in 
order to remove this exponential constant factor. 

3.2 Local Correctors 

An Llocal Corrector makes corrections at each * node, looking at a finite neigh- 
borhood at distance I, below the node. Clearly local corrections may not be as 
good as a global correction, but they are more efficient as they remove the expo- 
nential constant factor 2^. We propose a 1-local Corrector, or Local Corrector, 
i.e. look at the direct successors of a * node or neighborhood of size 1 below the 
node. 

A local correction at a * node proceeds as follows. We propose a new label 
f for a * node which guarantees that the tree T will be valid. We consider the 
string s of labels of the successors of the * node, select a rule of the DTD which 
minimizes the Tree Edit Distance and apply a local correction. 

Local Corrector 
Input: a file F and a DTD. 

Output: a valid file F' close to F. 

1. Inductive *-Marking of a tree. Follow a bottom-up run on the 
DOM tree and mark with a * the nodes where a parsing error occurs. 

2. Iterate Top-down at each * node. Propose a valid tag, i.e. a tag which 
leads to a valid tree, to a * node, apply a local correction at each node. 

Local Correction 

Input: a * node and its new label t, the string s of labels of successor nodes 
and a DTD. 

Output: the closest string s' of the DTD, the Edit distance and a sequence 
of local corrections at the * node. 

1. Consider all rules of type t : m where m is a content model, i.e. a 
regular expression in the language of C. 

2. Compute the Edit-distance between rn and s. Select the rule with the 
minimum distance and closest string s' . Make the local correction, i.e. the 
sequence of Deletion, Insertion and Modification to s to obtain s'. 
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The key function is to compute the distance between a string and a regular 
expression and there are efficient algorithms for this problem as in [15]. We then 
obtain a distance and a corrected string s' where nodes are added, removed or 
modified. In order to solve very efficiently the distance to a regular expression 
problem, we assume the DTD uses only the *,+ operators to a single tag and 
we say that such a DTD is in Unary Normal Form. If we compute the distance 
between s and a regular expression for the Edit distance with Moves, we would 
generalize the local corrector to the Tree Edit distance with Moves. 

3.3 Java Implementation Details 

An implementation of the corrector using the notion of local correction at every 
error node is available on the site http://www.lri.fr/^mdr/xml. It uses the Sax 
and Xerces parsers and its key features are: 

Phase 1: Parsing. As we read the file, we read the DTD with the Sax Parser, 
stores the declaration of each element in a class and obtain the DOM structure of 
the document by parsing it with the Xerces-j parser. We travel the DOM struc- 
ture Bottom-up, and assign Height and SubtreeSize to each node as attribute. 
Height will be used to set the preference for correction whereas SubtreeSize will 
be used to calculate the approximate edit distance. We isolate the error nodes 
(* nodes) along with the error statement as we consider several parsing possibil- 
ities. We store these nodes in an array. If there is any fatal error (i.e., the given 
XML document is not well formed) then exit, else go to the Correction phase. 

Phase 2: Correction. For each * node, we derive possible tags for that node 
such that the root of the tree accepts. For each possible tag t, we access the 
DNF of its Content Model, the minimum height of its subtrees (to compute the 
Edit distance while inserting or removing nodes) and the string s of tags of the 
successors. For each content model m, compute the Edit distance between s and 
m. Choose the model m with the minimum Edit distance. We obtain the string 
s' closest to s and a sequence of basic operations among M (Match), D (Delete), 
I (Insert), C (Change) which have to be realized for each letter of s. Modify the 
tree following these modifications. Check the attributes of each child node. If 
some attribute is missing then add it, if there is an undeclared attribute then 
remove it and if the attribute is not correctly defined then modify it. 

Compute the total Edit distance, for all the corrections done. Remove the 
attributes added as part of this program. Serialize this DOM structure to an 
XML Document. 



3.4 Example of a Corrected File 

Consider the following DTD which defines a class of right-branch trees. 
<?xml version=" 1 . 0"?> 

<!D0CTYPE a [<!ELEMENT a (1 ,r)>< ! ELEMENT r ((l,r)|q ) > 

<! ELEMENT 1 (#PCDATA) X! ELEMENT q (#PCDATA) >] > 
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Consider a file F whose DOM tree is represented in (a) of the following 
figure. At parsing time we generate two * nodes. Correcting top-down, the first 
* node is labelled r and the string s below at distance 1 is r. There are two 
possibilities for s': s'l = l,r or s '2 = q from the DTD. In the first case, we correct 
by introducing a left branch I and the Tree Edit Distance is 1. In the second 
case we drop the subtree labelled r and replace it by a branch labelled with q. 
The Tree Edit Distance is the size of the subtree, i.e. 11 and this choice is not 
taken. We proceed until the next * node labelled r and s = l,r,q. In this case 
s' = I, r and we drop the q branch. The total Edit Distance is 2. 




4 Comparative Study 

We consider three different DTDs and analyze the parsing and the correction 
time of files up to 800 nodes. The first DTD defines deep binary trees, the second 
DTD defines wide trees and the third DTD defines a mixture od deep and wide 
trees. 



4.1 Deep Tree Example 

The first DTD defines the class of right-branch trees with the tags a,l,r,q, as 
in the example of section 3.5. Given a valid binary tree, we randomly add extra 
branches to a node r, remove some left branches and change some of the r tags. 
Assume a distance of 10, i.e. we add 10 errors to various files from 50 nodes to 
800 nodes. 

The Figure 5 shows the approximate time in milliseconds taken for the pars- 
ing as well as the correction time of the XML document. It clearly indicates a 
correction time negligible compared to the parsing time. In this case, the possible 
expanded words w at each stage are very short and the Edit distance computa- 
tions are very efficient. 
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Parsing time and Correction time ofa doop tieo 




— Parsing time 
-•-Correction time 



Fig. 4. Analysis of Deep trees. 



4.2 Wide Tree Example 

The second DTD is closer to classical examples in databases where the branching 
degree is large. We represent a document with a title and many lines, each one 
composed of many chars. We introduce 10 random errors in wide valid trees of 
size 50 up to 800 nodes. 

<?xml version="1.0" encoding="UTF-8"?>< ! DOCTYPE d [ 

<!ELEMENT d (title , line*) X ! ELEMENT line (b,char*,c)> 

<! ELEMENT title (#PCDATA) X ! ELEMENT char (#PCDATA)> 

<! ELEMENT b (#PCDATA)X ! ELEMENT c (#PCDATA)>]> 




Fig. 5. Analysis of Wide trees. 



In this case, the discrepancy in the parsing and correction time is due to 
the large number of expansions of the regular expressions to be considered. 
This number increases drastically and we need to compute the edit distance for 
many probable corrections and then sort out the smallest one. Notice that the 
correction time is still less than the parsing time. 
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4.3 Deep Tree Having Some Wide Branches 

We now define a mixture of the previous DTDs. We have many bs nodes below 
which we attach a right-branch tree, as defined by the first DTD. Starting from 
valid trees, we generate invalid trees with 10 random errors of size 50 up to 800 
nodes. 

<?xml version="1.0"?X!D0CTYPE b [<!ELEMENT b ((a|b),r)> 

<!ELEMENT a (1 ,r) X ! ELEMENT r ((l,r)|q ) > 

<! ELEMENT 1 (#PCDATA) X! ELEMENT q (#PCDATA) >] > 



Parsingtime and Correction time for a mixed 
example 




Size of RIe 



Fig. 6. Wide branches of deep trees. 



In this last case, the correction time is the average of the two previous ex- 
ample and significantly less than the parsing time. 

If we had considered the global correction, the situation would have been 
quite different. We would need to check all the permutations of all possible 
corrections at each error node and then select the best correction. For deep 
trees, it would be very inefficient whereas for wide trees, it would be acceptable. 
The global method may have produced a better correction, but very inefficient 
for deep trees. 



4.4 The Book Example 

Consider the following standard realistic DTD describing a book document. On 
the site http://www.lri. fr/<^mdr/xml, there are several examples of invalid files 
with few errors and the corrected valid file is obtained on-line. In general, the 
user can upload his own file, provided the DTD is included and in Unary Normal 
Form. If the distance is not too large, he will obtain a corrected valid file on-line. 

<?xml version=’ 1 . 0’ ?> 

<! ELEMENT book (chapter* .title , author) > 

<! ELEMENT chapter (title , para*) X ! ELEMENT title (#PCDATA)> 

<! ELEMENT para (#PCDATA)X ! ELEMENT author (#PCDATA)> 
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A more general problem concerns the correction of a file F, using a DTD M\ 
with respect to a close DTD M 2 ■ It is essential to rank Web documents relative 
to a given DTD. We have to consider not only the given DTD but also close 
ones, in particular to adjust with the different languages used on the Web. The 
following DTD is a close french version of the previous one. If we wish to rank 
documents of type Book with their Edit-distance, we need to also consider french 
books, Chinese books and so on. 

<?xml version=’ 1 . 0’ ?> 

<!ELEMENT livre (chapitre*, editeur, auteur ,titre)> 

<!ELEMENT chapitre (titre,para*)> 

<! ELEMENT titre (#PCDATA) X ! ELEMENT para (#PCDATA)> 

<! ELEMENT auteur (#PCDATA) X ! ELEMENT editeur (#PCDATA)> 

We show that we can generalize the corrector to handle DTDs which are 
close to a given DTD. 

5 A Corrector for Homomorphic Data Type Definitions 

The correction algorithm can be generalized to DTDs which may not be the 
original DTD of the file, provided they are close. If a document F is a french book 
with a DTD close to the english one, we wish to find a mapping between french 
and english tags and an approximate distance between the french document and 
the english DTD. We generalize the distance between a tree and a DTD to a 
distance between two DTDs by defining a function of n, as the maximum of 
the distances between pairs of trees in Mi , M 2 of size n, making the definition 
symmetric. When they are close, i.e. the distance is 0(1), we propose to adapt 
the previous corrector. 



5.1 Distance on DTDs 

Consider two regular expressions or two DTDs. We first consider the case when 
they use the same language of tags £1 and then different languages £1 and £ 2 . 

Definition 2 The distance between two regular expressions r and t on the same 
language £1 is the function Dist{n,r,t) = Max{ Distl(ji,r,t), Distl(ri,t,r) } 
where 

Distl{n,r,t) = Max^^r,\w\=n{ dist{w,t) } 

Similarly, the distance between two DTDs M\ and M2 on the same language £1 
is the function Dist{n, Mi, M2) = Max{ Distl{n, Mi, M2), Distl{n, M2, Mi) } 
where 

Distl{n, Ml, M2) = Maxp^Mi,\F\=n{ dist{F, M2) } 

We say that two DTDs are at distance 0(1) or at constant distance if 
Dist{n,r,t) = 0(1). We are interested in short distances, say up to logn, as 
we wish to capture close DTDs. In the sequel we only consider DTDs at con- 
stant distances. 
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Example 1 . Let L\ = {a, 6 , c, d, u, w}. li r = au * bv * cd and t\ = au * abv * c 
then Dist{n,r,ti) = 0 ( 1 ). If ^2 = f * c then Dist{n,r,t2) = 0 {n). 

If regular expressions or DTDs use different languages C\ and we wish to 
generalize this definition. We consider mappings tt from C\ U {_L} to £2 U {_L} 
such that 7t(_L) = _L. We then extend the previous definition to two regular 
expressions r and t by taking the Minimum over all tt of the distance between 
7 r(r) and t. We interpret _L as the empty symbol, i.e. a._L = _L.a = a 
Definition 3 Two regular expressions r and t are isomorphic if there exists a 
mapping tt between the symbols (tags) of r in L\ and the symbols (tags) oft in 
£2 such that 7 r(r) = t. Two regular expressions r and t in Unary Normal Form 
such that |r| > |t| are homomorphic if there exists a mapping tt between £1 U{_L} 
and £2 U {_L} such that the distance between 7 r(r) and t is 0 ( 1 ). 

In case |r| < |t|, consider mappings from £2 into £1 and check Tr(t) with r. 
Example 2 . Let £1 = {a, b, c, d, u, u} and £2 = {a, &, c, d, u, v} If r = au*bv * cd 
and ti = cx * y * e then Dist(n,r,ti) = 0 ( 1 ) with tt such that 7 r(a) = c, 
tt{u) = x,Tr{b) = _L, 7 r(t>) = y,Tr{c) = e and Tr(a) = c. Hence au * bv * cd and 
cx * y * e are homomorphic. 

If t2 = cx * y * e* then Dist{n,r,t2) = 0 {n) for any mapping tt and these 
regular expressions are not homomorphic. 

We generalize now the previous definition to two DTDs on different languages, 
assuming they are reduced, i.e. each rule influences the tree language associated 
with the DTD. Consider that the DTD provides the DNF form for each tag. For 
a tag a, let DNF{a) be the regular expression which defines a. Suppose without 
loss of generality that |£i| > |£2|. Call a tag a recursive if either a* occurs in 
some content model or if a occurs in the DNF form of a or if a is on a loop in 
the dependency graph whose nodes are tags and edges link a tag b with all the 
tags in its DNF. 

Definition 4 Two DTDs Mi and M2 with roots ri and r2 are homomorphic if 
there exists a mapping tt between £1 U {T} and £2 U {T} such that: 

— if a is recursive in M\ then Tr(a) is recursive in M2 and tt{DN F{ a)) is 
isomorphic to DN F{Tr{a)) 

— ifbis non recursive in M\ then tt{DN F{ b)) is homomorphic to DNF{Tr{b)). 

— 7r(ri) = 7r(r2). 

If |£i| < |£2|) we require tt to map £2 U {T} into £1 U {T} as before. 
Example 3 . Let £1 = {hook, chapter, title, author , para} for the french DTD and 
£2 = {livre, chapitre,titre, auteur, para, editeur} for the english DTD. In this 
case chapitre,para, titre are recursive in the french DTD M2- 

Let TT such that Tr{livre) = hook, Tr{chapitre) = chapter ,tt {titre) = 
title, tt{ auteur) = author ,tt {para) = para,TT {editeur) = T. 

In this case the recursive tags are mapped to isomorphic DNFs 
and the non recursive tag livre is mapped to a homomorphic DNF as 
Tr{chapitre*, editeur, auteur, titre) = {chapter*, title, author) is at a constant dis- 
tance from {chapter*, author, title). 
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Proposition 2 Two homomorphic regular expressions r,t are at distance 0(1). 
Two homomorphic DTDs are at distance 0(1). 

Given a file F with a DTD Mi, and another external DTD M 2 , we wish to 
estimate the distance between F and M 2 . We can look for all possible tt which 
are homomorphisms between M\ and M 2 , and look at the distance between 
tt{F) and M 2 with the previous corrector. This procedure would be exponential 
in the number of tags and inefficient in practice. We can however generalize the 
approach of a corrector: we build bottom-up a possible tt and correct top-down 
as before. 

5.2 Generalized Local Corrector 

A (d, l)-local Generalized Gorrector looks bottom-up at finite neighborhoods at 
distance d above a node and constructs the tt which minimizes the edit distance. 
As we proceed, bottom-up, we only extend a given tt and quit if the distance is 
too large. We apply the corrector top-down as before to estimate the distance 
between F and an external DTD M 2 . 

The local construction of a tt is not as good as the exhaustive search for the 
best 7T but is very efficient. We propose a (2, l)-local Generalized Gorrector: we 
look at depth 2 above as we proceed bottom-up and at depth 1 top-down as 
before. It has the advantage of catching quickly the expressions a* which have 
to be matched to a b* in the other DTD. We call GLC for Generalized Local 
Gorrector such a (2, l)-local Generalized Gorrector. 

GLC 

Input: a Gle [F, Mi), an external DTD M 2 and parameter k. 

Output: the approximate distance between F and M 2 , when it is less than 
k and a corrected file F’ . 

1. Follow a bottoni-up run on the DOM tree and inductively construct 
the best n. Mark with a * the nodes where a parsing error for M 2 occurs. 

2. Proceed Top-down as in the classical Corrector. Propose a valid tag, 
i.e. a tag which leads to a valid tree, to a * node, apply a local correction at 
each node. 

We build a potential mapping tt, bottom-up and estimate the Edit Distance. 
If it is greater than k, we stop and consider another mapping. If we reach the root 
with a potential mapping tt, we correct 7t(A). The advantage of this construction 
is that very few possible tt are considered, especially when started with a random 
subtree of depth 2 from the leaves. We construct an approximate tt bottom-up 
and obtain with the Gorrector an approximate distance Top-down. 

6 Conclusion 

A corrector for XML documents can estimate in linear time the distance between 
a document and a DTD when this distance is small. It is a useful operation to 
rank documents relative to several DTDs. The proposed implementation of a 
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local corrector uses the Xerces parser and shows that in general the correction 
time is less than the parsing time and therefore scales up. It only corrects docu- 
ments for a fixed constant distance but we intend to generalize the corrector for 
the Tree Edit distance with moves. In this case we would correct for a distance 
up to e • n, as predicted by the theory. 

We proposed the generalization of a corrector to external DTDs, i.e. to handle 
a file {F, Mi) whose DTD is Mi with an external DTD M2, provided the distance 
between Mi and M2 is constant. At the heart of the problem lies the approximate 
equivalence of regular expressions. 
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Abstract. We introduce a method for building an XML constraint validator from 
a given set of schema, key and foreign key constraints. The XML constraint val- 
idator obtained by our method is a bottom-up tree transducer that is used not only 
for checking, in only one pass, the correctness of an XML document but also for 
incrementally validating updates over this document. In this way, both the veri- 
fication from scratch and the update verification are based on regular (finite and 
tree) automata, making the whole process efficient. 



1 Introduction 

We address the problem of incremental validation of updates performed on an XML 
document that respects a set of schema and integrity constraints (i.e., on a valid XML 
document). Given a set of schema and integrity constraints T>, we present a method 
that translates T> into a bottom-up tree transducer 14 capable of verifying the validity 
of the document. We only address meaningful specifications [11], i.e., ones in which 
integrity constraints are consistent with respect to the schema. The aim of this work 
is the construction of a transducer 14 that allows incremental validation of updates. In 
this paper, we deal mostly with the verification of key and foreign key constraints. The 
validation of updates taking into account schema constraints (DTD) is performed by 
U exactly as proposed in [5]. Our framework takes into account attributes as well as 
elements: details concerning the treatment of attributes are presented in [5,6]. Here, for 
the sake of simplicity, we disregard specificity of attributes. The main contributions of 
the paper are: 

• A method for generating a validator from a given specification containing schema, 
key and foreign key constraints. 

• An unranked bottom-up tree transducer, which represents the validator, where syn- 
tactic and semantic aspects are well separated. 

• An incremental schema, key and foreign key validation method. 

* Supported by CAPES (Brazil) BEX0706/02-7 

** This work was done while the author was on leave at Universite Eran^ois Rabelais. Supported 
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• An index tree that allows incremental updates on XML document. This key index 
can also be used for efficiently evaluate queries. 

This paper is organized as follows: section 2 gives an overview of the incremental 
constraint checking framework. Section 3 presents our method to build a tree transducer 
from a given specification containing a DTD and a set of keys and foreign keys. We 
also show how the transducer is used to efficiently verify all the imposed constraints. 
Section 4 shows how incremental validation is performed on updates. Section 5 concludes 
and describes our future research directions. 



2 General Overview 

An XML document is a structure T composed by an unranked labeled tree t and functions 
type and value. Tree f is a mapping t : dom{t) — >■ S where dom(t), called the set of t’s 
positions, is a set of finite strings of positive integers closed under prefixes (see Fig. 1). 
We write t{p) = a for p G dom(t) to indicate that the symbol a is the label in S associated 
with the node at position p. The function type{t, p) indicates the type {element, attribute 
or data) of the node at position p. The function value{t,p) gives the value associated 
with a data node. 

Fig. 1 shows part of the labeled tree representing the document used in our examples. 
It describes menus and combinations in some French restaurants. Differently from the a 
la carte style, a combination is a grouping of dishes and drinks, reducing both the choice 
and the price for clients. Each node in the tree has a position and a label. Elements 
and attributes associated with arbitrary text have a child labeled data. In Eig. 1 attribute 
labels are depicted with a preceding @ . 




(Sancerre) (2000) (21.00) (Cahors) (2002) (25.00) 



Fig. 1. Labeled tree t of an XML document 



Definition 1. Key and foreign key syntax [8]: A key is represented by (P, (P', {P^, 
. . . , P™})). A foreign key is represented by (Pq, (Pg, {Pg^, . . . , Pg"})) C K where 
K = (P, (P', {P^, . . . , P"*})) is a key such that P = Pg. In a key, path P is called the 
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context path', P' the target path and P^, . . . , P™ the key paths. The same applies for a 
foreign key, except for Pq, . . . , P™ that are called /ore/gn key paths. □ 

All the paths in the definition above use only the child axes. Context and target paths 
should reach element nodes. Key (or foreign key) paths are required to end at a node 
associated to a value, i.e., attribute nodes or elements having just one child of type data. 
The next example gives the intuition of the semantics of key and foreign key constraints 
over the document of Fig. 1. 

Example 1. Let K\ = (lrestaurant,(.lmenuldrinks/wine, {./name, ./year})) be a key 
constraint indicating that, in the context of a restaurant, a wine (the target node) can 
be uniquely identified by its name and its year. Let FK 2 = {/restaurant, {./combina- 
tions/combination, {./wineName, ./’wineYear})) C Ki be a foreign key constraint indi- 
cating that, for each restaurant, a combination is composed by a wine that should appear 
in the menu of the restaurant. □ 

Definition 2. Key and foreign key semantics: An XML tree T satisfies a key (P, (P', 
{P^, . . . , P™})) if for each context position p defined by P the following two conditions 
hold: (i) For each target position p' reachable fromp viaP' there exists a unique position 
Ph fromp', for each P^(l < h < m). (ii) For any target positions p' and p", reachable 
from p via P', whenever the values reached from p' and p" via P^(l < h < m) are 
equal, then p' and p" must be the same position. Similarly, an XML tree T satisfies a 
foreign key (Pg, (Pg, {Pg^, . . . , P™})) ^ K if: (i) it satisfies its associated key K and 
{a) each tuple r of values, built following paths Pg/ P^/ P^ , . . . ,Pg/ Pg/P™ (in this 
order), can also be obtained by following the paths P/P' /P^, . . . , P/P' /P'^ (in this 
order). □ 

In the following, we assume an XML tree T and a set of schema and integrity 
constraints T> and we survey (z) the validation of T from scratch which is performed in 
only one pass on the XML tree and {ii) the incremental validation of updates over T. 

2.1 Validation from Scratch 

Our method consists in building a tree transducer capable of expressing all the constraints 
of a given specification T>. The tree transducer is composed by a bottom-up tree automata 
(to verify the syntactic restrictions) and a set of actions defined for each key and foreign 
key. These actions manipulate values and are used to verify the semantic aspects of 
constraints. The execution of the tree transducer consists in visiting the tree in a bottom- 
up manner', performing, at each node: 

A) The verification of schema constraints. Schema constraints are satisfied if all posi- 
tions of a tree t can be associated to a state and if the root is bound to a final state 
(defined by the specification). A state q is assigned to a position p if the children of 
p in t verify the element and attribute constraints established by the specification. 
Roughly, a schema constraint establishes, for a position labeled a, the type, the num- 
ber and (for the sub-elements) the order of p’s children. We assume that the XML 
document in Fig. 1 is valid wrt schema constraints (see [5] for details). 

* Notice that it is very easy to perform a bottom-up visit even using SAX [14](with a stack). 
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B) The verification of key and foreign key constraints. In order to validate key and 
foreign key constraints we need to manipulate data values. To this end, we define 
the values to be carried up from children to parents in an XML tree. The following 
example illustrates how the transducer treats values being carried up for each node. 
This treatment depends on the role of the node’s label in the key or foreign key. 



Example 2. We assume a tree transducer obtained from specihcation T> (containing a 
given DTD together with K\, FK 2 of Example 1) and we analyze its execution over T 
(Fig. 1): 

1. The tree transducer computes the values associated to all nodes labeled data. 
We consider va/Me(020000) = value{03000) = Sancerre and value{020010) = 
va/Mc(03001) = 2000 as some of the values computed in this step. 

2. The tree transducer analyzes the parents of the data nodes. If they are key or foreign 
key nodes, they receive the values computed in step 1 . Otherwise, no value is carried 
up. In our case, the value Sancerre is passed to key node 02000 and to foreign key 
node 0300. The value 2000 is passed to key node 02001 and to foreign key node 
0301. 

3. The tree transducer passes the values from children to parent until it hnds a target 
node. At this level the values for each key or foreign key are grouped in a list. Node 
0200 is target for Ki, and as the key is composed by two items, the list contains 
the tuple value {Sancerre, 2000). Similarly, node 030 (target node for FK 2 ) is 
associated to {Sancerre, 2000). 

4. The transducer carries up the lists of values obtained in step 3 until finding a context 
node. At a context node of a key, the transducer tests if all the lists are distinct, 
returning a boolean value. Similarly, at a context of a foreign key, the transducer 
tests if all the tuples exist as values of the referenced key. In our case, restaurant is 
the context node for both Ki and FK 2 . As context node for ATi, it receives several 
lists, each containing a tuple with the wine name and year. The test verihes the 
uniqueness of those tuples. As context node for FK 2 , it receives several lists, each 
containing a tuple with the name and year of a wine of a combination. The test 
verihes if each tuple is also a tuple for key Ki. For instance, {Sancerre, 2000) that 
represents a wine in a combination, appears as a wine in the menu of the restaurant. 

5. The boolean values computed in step 4 are carried up to the root. Ki and FK 2 are 

satished if the conjunction of the boolean values results in true. □ 

Notice that, at each context node, key and foreign key constraints are verihed by 
respecting a specihc order: only after testing all key constraints, the transducer verihes 
foreign key constraints (wrt key values already tested). We recall that the context path 
of a foreign key must be the same as the context path of its corresponding key. 



2.2 Incremental Validation of Updates 

Let us now consider updates over valid XML trees. To this end, we suppose that: 
- Updates are seen as changes to be performed on the XML tree T. 
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- Only updates that preserve the validity of the document (with respect to schema, 
key and foreign key constraints) are accepted. If the update violates a constraint, 
then it is rejected and the XML document remains unchanged. 

- The acceptance of an update relies on incremental validation tests, i.e., only the va- 
lidity of the part of the original document directly affected by the update is checked. 

We focus on two kinds of update operations. The insertion of a subtree T' at position 
p of T and the deletion of the subtree rooted at p in T. To verify if an update should be 
accepted, we perform incremental tests, summarized as follows: 

1 . Schema constraints: We consider the run of the tree transducer on the subtree of T 
composed just by the updated position p, its siblings and their father. If the state 
assigned to p’s father does not change due to the update, i.e., the tree transducer 
maintains the state assignment to p’s father as it was before the update, then schema 
constraints are not violated (see [5] for details). 

2. Key and foreign key constraints: To facilitate the validation of keys and foreign keys 
for an update operation, we keep an index tree of those tuples in T dehned by each 
key. For each key tuple a reference counter is used in order to know how many times 
the tuple is used as a foreign key. 

The verification of key and foreign key constraints changes according to the update 
operation being performed. Firstly we have to find (for each key and foreign key) 
the corresponding context node p', concerned by the insertion or the deletion. Then, 
in order to insert a subtree T' at position p of T we should perform the following 
tests: (i ) verify whether T' does not contain duplicate key values for context p' , {it) 
verify whether T' does not contain key values already appearing in T for context p', 
{in) verify whether T' does not contain foreign key values not appearing nor in T' 
neither in T for context p' and {iv) for each key tuple in context p' being referenced 
by a foreign key in T' , increase its reference counter. 

Similarly, to delete a subtree T^ rooted at position p, from an XML tree T we 
should perform the following tests, for each context p'\ (i) verify if T'' contains 
only key values that are not referenced by foreign keys (not being deleted) and (ii) 
for each key tuple in context p' being referenced by a foreign key in 7~', decrease 
its reference counter. 

The acceptance of an update over an XML tree T wrt keys and foreign keys requires 
information about key values in T. Given an XML tree T, the tree transducer is used 
once to verify its validity (from scratch). During this first execution of the tree transducer 
an index tree, called keyTree, is built for each key constraint K that should be respected 
by T. Each keyTree k is a tree structure that stores the position of each context and target 
node together with the values associated to each key node in T. Fig. 2 describes this 
index structure using the notation of DTDs and Fig. 3 shows a keyTree for key Ki of 
Example 1. The next example illustrates the validation of updates. 

Example 3. Given the XML tree of Eig. 1, we show its incremental verification due to 
the insertion of a new wine in the menu of a restaurant (i.e., the insertion of a labeled tree 
f at position p = 0200 of f). Moreover, we consider a specification stating that a position 
p labeled drinks should respect the following schema constraint: the concatenation of 
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<!D0CTYPE keyTreet 
< [ELEMENT keyTree (context*)> 

<!ATTLIST keyTree nameKey CDATA #REQUIRED> 

<! ELEMENT context (target+)> 

<!ATTLIST context pos CDATA #REQUIRED> 

< [ELEMENT target (key+)> 

<[ATTLIST target pos CDATA #REQUIRED ref Count CDATA #REQUIRED> 
<[ ELEMENT key #PCDATA>] 



Fig. 2. DTD specifying structure keyTree 



keyTree 




(0200) (1) (Sancerre) (2000) (0201) (3) (Cahors) (2002) 



Fig. 3. KeyTree K-^ built over the document of Fig. 1 



the labels associated with p’s children composes a word that corresponds to the regular 
expression (7u,„e*- 

The verification of the update with respect to schema constraints consists in: (z) consid- 
ering that the update is performed (without performing it yet) and {ii) verifying if the 
state qdrinks can still be associated with position 020 (0200’s father) by analyzing the 
schema constraint imposed over nodes labeled drinks. To this end, we build the sequence 
of states associated with 020’s children. The insertion consists of shifting to the right the 
right siblings of p. Thus, we consider state qwine associated to positions 0201 and 0202 
and we only calculate the state associated with the update position 0200. As the root of t' 
(at position 0200) is associated to the state qwine, we obtain the word q^ine q-wine q-wine- 
This word matches the regular expression qwine* - Thus, the update respects the schema 
constraints [5]. 

Now we verify whether Ki and FK^ (Example 1) are preserved by the insertion. As the 
inserted subtree contains only one key value, it contains no key violation by itself (no 
duplicate of key values). Then we assume that the update is performed (without perform- 
ing it yet) and we verify whether the key value being inserted is not in contradiction with 
those already existing in the original document. In our case, we suppose that the wine 
being inserted is identified by the key tuple {Bordeaux, 1990). Comparing this value to 
those stored in the keyTrecKi (Fig. 3), we notice that no violation exists. The inserted 
subtree does not contain foreign key values and, thus, we can conclude that the update 
is possible with respect to key and foreign key constraints. 

As the above tests succeed, the insertion can be performed. The performance of an update 
implies changes not only on the XML tree but also on index trees keyTree^. □ 
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3 Tree Transducers for XML 

We first present the definition of our tree transducer. This transducer combines a tree 
automaton (expressing schema constraints) with a set of output functions (defining key 
and foreign key constraints). In this paper we disregard the specificity or attributes. 
Definition 3. Output function; Let D be an infinite (recursively enumerable) domain 
and let D* denote the set of all lists of items in D. Let T = (f, type, value) be an XML 
tree. An output function f takes as arguments: (i) a tree position p G dom{t) and (ii) a 
list I of items in D. The result of applying f(p, 1) is a list of items in D. In other words, 
/ : dom{t) X D* ^ D*. □ 

We recall the process described in Example 2: at each node, data values are collected 
from children nodes and can be used to perform tests. Output functions are defined to 
perform these actions: for the node at position p, each output function takes as parameters 
the list I of values coming from p’s children. One output function is defined for each key 
and foreign key. 

Definition 4. Unranked bottom-up tree transducer (UTT); A UTT over S and D is 

a tuple 14 = {Q, S, D,Qf, A, F) where Q is a set of states, Qf C Q is a set of final 
states, Z\ is a set of transition rules and F = {/i, ...,/„} is a set of output functions. 

Each transition rule in A has the form a,E ^ q where (i) a G (ii) E is a regular 
expression over Q and (Hi) q G Q. Each output function in F has the form fj{p, 1) = I' 
as in Definition 3. □ 

Key and foreign key constraints are expressed by output functions in F. As the tree 
is to be processed bottom-up, the basic task of output functions is to define the values 
that have to be passed to the parent position, during the run. 



3.1 Generating Constraint Validators 

Given a specification T> = (D, K) where D is a set of schema constraints and K is 
composed by a set of keys and foreign keys, we propose a method to translate T> into 
a UTT. In this sense, we present an algorithm to generate a validator from a given 
specification. This validator is executed to check the constraints in T> for any XML tree. 

Let 14 = (Q, E,D,Q f, A, F) be a UTT whose transition rules in A are obtained 
from the translation of a DTD D (part of V). Each output function in F is related to a 
finite state automaton that indicates which nodes play a role in keys and foreign keys. 
Notice that context, target and key nodes in each key Kj or foreign key FKj are defined 
in a top-down fashion. In order to identify these nodes using a bottom-up tree automaton, 
we must traverse the paths stated by each key Kj or foreign key FKj in reverse. If we 
see paths as regular expressions, then we can associate finite state automata with them. 
Paths in reverse are recognized by reversing all the transitions of these automata [13]. 

Given a key constraint Kj (1 < j < k) or a foreign key constraint FKj (k + 
1 < J < n), we use the following notations: for context path Pj, we have Mj = 
{Oj, E, Sj,ej,Fj); for target path Pj, Mj = (6>j , E, 6j,e'j,Fj)-, for key or foreign key 
paths Pj I ... I P^f Mj = {0j , E,Sj ,e'j , Fj). Eor the sake of homogeneity, we 
define Mp = ({eo, e/}, {root}, {S{eo, root,ef)}, cq, {e/}) as the finite state automaton 
recognizing the path formed just by the symbol root. Eig. 4 Illustrates the finite state 
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automata that recognize the paths in Ki and FK 2 of Example 1 in reverse. 

Remark'. We denote by M.e the current state e of the hnite state automaton M, and we 
call it a configuration. 




M^’ : 
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Fig. 4. Automata corresponding to the paths of Ki and FK2 in reverse 



Algorithm 1 - Key constraints as output functions: 

Input: A set of k keys {Kj = {Pj, (Pj, {Pj, })) | 1 < J < k}, a set of (n — k) 

foreign keys {FKj = {Pj, {Pj, {P^, ..., P™^ })) C AT | (fc + 1) < j < n; AT is a key } 
Output: A set of output functions P = {/i, . . . , 

begin 

P = 0 

for each Kj and F Kj do 

Build the finite automata Mj, Mj and M” 

Use Mj, Mj and M” to specify the function fj 
P = PO{f,} 
return P 
end 

Each output function fj G P is specified by the algorithm below: 

Algorithm 1.1 - Specification of output functions 

Input: Automata concerning a key AT or a foreign key FK. 

A position p and a list I of pairs in D. For each pair, the hrst element is an automaton 
configuration and the second element is a list of values. 

Output: A list of pairs in D. 
begin 

Let a '.= t{p) / /a is the label of position p 

(1) If a = data then return [value{t,p)])] 

(2) If a is a target label for K or FK 

then return [{M'.S'{e',a),checkArity{concat{filterkey{l))))] 
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Function filterkey leaves in the key lists only the values associated to key positions 
of K (or FK). For that purpose, it selects the singletons whose configuration cor- 
responds to a final state of M' . Function concat returns the concatenation of all its 
argument lists into one list. If the length of the resulting list does not correspond 
to the length m of K, then function checkArity replaces it by an empty list. For 
foreign keys the length is not tested. 

(3) If a is a context label for a key K 

then return [{M.6{e,a), checkKey{filtertarget{l))] 

f [true] if . . . Vm are all nonempty distinct lists, 
where checkKey([vi . . . Vm]) = s , 

[false] otherwise. 

(4) If o is a context label for a foreign key FK 

then return [{M.5{e,a),checkForeign{filtertarget{l))] 

{ [true] if . . . Vm are lists whose values appear 
in the key taking part in the 
definition of FK. 

false] otherwise. 

Remark: In cases (3) and (4) above, function filtertarget rejects all the values not 
belonging to target lists of key K (or foreign key FK), i. e. , those whose configuration 
does not correspond to the final state of M' . 

(5) If a is the root label then return [{Mp.Cf ,concat{ filter contexti^)))] 

Function filter context rejects all the values not belonging to context lists (configu- 
ration different from the final state of M). 

(6) In all other cases 

(i.e., when a data and a is not a target label, nor a context label, nor the root) 
return carryUp{l) 

where function carryUp is defined as follows: 
function carryUp (L : list of pairs) 
var result : list of pairs 
begin 
result t— [ ] 

for each c = {M.e, v) in L //* M stands for M, M' or M” 

if 5{e, a) = e' is a transition in M then result t— concat(result, [(M.e', u)]) 
return result 
end 

end □ 

In cases (1) to (5) the resulting list contains only one pair. A pair is always composed 
by: 

(A) A configuration M.e where M is one of the finite automata representing paths in 
keys, and e is a state ofM. For example, in case (2), M is M', the target automaton 
for K or FK. This configuration is obtained by performing the first transition at 
automaton M' , using the symbol a as input. Notice that S'(e' , a) is a state of M' . 
Other cases are similar. 

(B) A list of values. From data nodes to target nodes the list contains only one value. 
From target nodes to context nodes the list contains the values composing a key 
(or foreign key). From context nodes to the root the list contains one boolean value 
indicating that within a given context, K or FK holds or not. 
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Notice that for foreign key context level (case (4)), FK and its associated key have 
the same context and the tuples representing the key are computed before those that 
represent foreign key FK (since Kj{l < j < k) and FKj{k + 1 < j < n)). Once 
computed, key tuples are stored in keyTrees, then foreign key tuples can be checked (as 
shown in the next section). At root level (case (5)), we have the boolean values that were 
obtained for each subtree rooted at the context level. In case 6, values are carried up 
by function carryUp. This function selects pairs from children nodes belonging to key 
and foreign key paths, by checking configurations in these pairs. The resulting list can 
contain more than one pair. If nodes are not concerned by any key or foreign key, the 
function carryUp does not transmit any value. 

3.2 Validating XML Documents 

The verification of keys and foreign keys are performed simultaneously, in one pass, 
together with schema validation, during the execution of the UTT over an XML tree. 
Example 4 illustrates such an execution while Definitions 5 and 6 formalize it. The index 
keyTree, necessary to perform incremental updates on XML documents, is dynamically 
built. Similarly to the one proposed in [9], it is a tree structure containing levels for the 
key name, context, target, key and data nodes as defined in Fig. 2. 

Example 4. We consider a specification V confaining Ki and FK 2 (Example 1). The 
finite state automata associated to Ki and FK 2 are the ones given in Fig. 4. To verify 
if the XML tree T of Fig. 1 satisfies K\ and FK 2 we run the transducer U (from T>) 
over T (recall that W contains two output functions /i and /2 defined from Ki and FK 2 
(respectively), following Algorithm 1): 

1 . For the data nodes, each output function returns a singleton list that contains a pair: 
the initial configuration of the key (or foreign key) automaton M”, and the value of 
the node. Positions 020000 and 03000 are data nodes, then we have: 

/i(020000, []) = [(M".eo, [Sancerre])]; /2(03000, []) = [Sancerre])]. 

2. The fathers of data nodes which are key (or foreign key) nodes should carry up the 

values received from their children. Thus, each of them executes a first transition in 
M” using each key (or foreign key) label as input. For each father of a data node 
which is not a key (or a foreign key) node, the output function returns an empty list. 
For instance, position 02000 is a key node for Ki and position 0300 is a foreign key 
node for L’A' 2 - Then, reading the label name from state eg of M", we reach state ei, 
and we carry up the value Sancerre. We obtain a similar result for F K 2 when reading 
label wineName: /i(02000, [{M[' .eg, [Sancerre])]) = [Sancerre])]; 

/2(0300, [{M 2 -Co, [Sancerre])]) = [Sancerre])]. 

At this stage the construction of keyTreexi starts by taking into account the infor- 
mation associated to each key node (e.g., keyTree [t, 02000] is the subtree rooted 
at key and associated with the value Sancerre in Fig. 3). 

3. For node 0200, wine is a target label of Ki and for node 030, combination is a 
target label of FK 2 . In order to transmit only key (or foreign key) values, the output 
function of a target label (i) selects those that are preceded by a final state of the key 
automaton M", (ii) joins them in a new list, and (in) executes the first transition 
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of the target automaton M' . In this way, at a target position the tuple value of a key 
(or foreign key) is built: 

/i(0200, [(Mf.ei, [Sancerre]), (Mf.ez, [2000])]) = [Sancerre, 2000])]; 

/2(030, [(M".ei, [Sancerre]), (M".e2, [2000])]) = [(M'.C4, [Sancerre, 2000])]. 
The construction of keyTrecKi continues and keyTrecKi [t, 0200] is obtained taking 
into account the information available at position 0200. (See subtree rooted at target 
in Fig. 3). 

4. The computation continues up to the context, verifying whether the labels visited 
are recognized by the target automaton or not and carrying up the key (or foreign 
key) values. For instance, we reach state 65 in M[ by reading the label ""drinks” 
(Fig. 4): 

/i(020, [(M{.e4, [Sancerre, 2000])]) = [(M(.es, [Sancerre, 2000])]; 

5. For the node 0, the label restaurant is a context label of both Ki and FK 2 - For Ki 
(respectively FK 2 ) the output function selects the sublists associated to a final state 
of the target automaton M[ (respectively M' 2 ). The output function of Ki checks 
if all the selected sublists are distinct. The output function of FK 2 verifies if the 
selected sublists correspond to lists of values obtained for Ki. In both cases, the 
output functions return a boolean value that will be carried up to the root: 

/i(0, [(M(.e6, [Sancerre, 2000]), [Ca/iors, 2002])]) = [(Mi. eg, [trne])]; 

/2(0, [(M2.e5, [Sancerre, 2000])]) =[(M2.C7, [trwe])]. 

At this point, we have keyTrecKi [t, 0] represented by the subtree rooted at context 
in Fig. 3. Notice that the attribute ref Count for tuple {Sancerre, 2000) has value 
1 because at this context node, the tuple {Sancerre, 2000) exists for foreign key 
FK 2 - Indeed, at the context level we increment the ref Count of each key tuple that 
corresponds to a foreign key tuple obtained at this level. Supposing that the tuple 
{Cahors, 2002) appears in three different combinations (not presented in Fig. 1), 
we would have ref Count= 3 for it. 

6. At the root position the last output function selects the sublists that are preceded 

by a final state of the context automaton M and returns all boolean values in these 
sublists. The construction of keyTreeKi finishes by a label indicating the name of 
the key (Fig. 3). □ 



Definition 5. A run of W on a finite tree t: Let f be a H’-valued tree and U = 

(Q, 27, D, Qf, A, F) be a UTT. Given the keys Ki , . . . , Kk and foreign keys FKk+i, 
. . . , FKn a run of 14 ont is: (i) a. tree r : dom{r) — >■ Q such that dom{r) = dom{t); 
(ii) a function $ : dom{r) — >■ (D*)" and (Hi) k keyTrees. 

For each position p whose children are those at positions^ pO, . . . ,p(z — 1) (with 
z > 0), we have: 

(i) r(p) = g if the following conditions hold: 

1. t{p) = a G S. 

2. There exists a transition a, E ^ q in A. 

3. r(p0) = qo,..., r{p{z - 1)) = g^-i. 

^ The notation p{z — 1) indicates the position resulting from the concatenation of the position p 
and the integer 2 — 1. If 2 = 0 the position p has no children. 
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4. The word qo ■ ■ ■ Qz-i belongs to the language generated by E. 

(a) $(p) = 1 = (fi{p,concat{ll, . . . , fn{p,concat{lJ}, . . . with 

$(p0) = lo, ■ ■ ■ , $(p(z — 1)) = Iz-i where each k = {Ij , . . . , /”) is a n-tuple. 

(in) for (1 < j < k), keyTreeXj [f,p] is constructed using the already computed 
keyTreeXj [t,pO], ■ ■ ■ MyTrecKj [t,p{z — 1)], as follows: 

(a) If t{p) is a key label of Kj, then keyTree^j [t,p] is the tree: 

<key> t{p) = value{t, pO)</key> 

(b) If t{p) is a target label of Kj, then keyTrecK^ [t, p] is: 

<target pos=p refCount=0> keyTrecKj [t,pO] ■ ■ - keyTreeKj [k,p{z — 1)] 
</target> 

(c) If t{p) is a context label of Kj, then keyTree^j [t, p] is: 

<context pos=p> keyTreeKj [t,pO] . . . keyTreeKj [t,p{z — 1)] </context> 
Moreover, if t{p) is a context label of a foreign key FK, then increment the 
attribute ref Count in the corresponding keyTreeKj ■ 

(d) If t{p) is the root label then keyTreeKj [t, p] is the tree: 

<keyTree nameKey=Jfj> keyTreeKj [t, pO] . . . keyTreeKj [t,p{z — 1)] </keyTree> 

(e) In all other cases, for each key Kj, we define keyTreeKj [t,p] as the forest 
composed by all the trees keyTreeKj [f,pO] . . . keyTree^. — 1)]. 

Notice that, although the keyTrees are defined in general as forests, for the special 
labels mentioned in cases (a) to (d) above, we build a single tree. □ 

Definition 6. Validity: An XML tree t is said to be valid with respect to schema con- 
straints if there is a successful run r, i.e., r{e) & Qf- An XML tree t is said to be valid 
with respect to key and foreign key constraints if the lists of $(e) contain only the value 
true for each key and foreign key. □ 

Remark that item {ii) of Definition 5 specifies that the output for each position p in 
the XML tree is a tuple composed by one list for each key (or foreign key) being verified. 
Each list Ij in the tuple is the result of applying the output function fj , defined for the 
jth key or foreign key, over the following arguments: 

- p: the position in dom{t). 

- concat{lQ , ■ ■ ■ , li- 1 ) : the list formed by the information carried up from the children 
of p, concerning the jth key. 

At the end of the run over an XML tree, each key Kj is associated to a keyTree 
Kj that respects the general schema given by Fig. 2. Attribute pos stores the target and 
context positions for a given key and attribute refCount indicates when a key Kj is 
referenced by a foreign key. 

4 Incremental Validation of Updates 

We consider two update operations, denoted by insert(T,p, T' ) and delete(p,'T), where 
T and T' are XML trees and p is a position. Fig. 5 illustrates these operations on a 
X-valued tree. Only updates that preserve validity wrt the constraints are accepted. 
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Fig. 5. (i) Initial Z'-valued tree t having labels a (position e), b (position 0) and c (position 1). (ii) 
Insertion atp = 2. (Hi) Insertion atp = 1. (iv) Deletion atp = 2. 

4.1 Incremental Key and Foreign Key Validation 

LetT = (t, type, value) be a valid XML tree, i.e., one satisfying a collection of keys Kj 
(1 < j < k) and foreign keys FKj ((/c+ 1) < j < n). 'LeiU = {Q, S, D, Qf, A, F) be 
a UTT specifying all the constraints that should be respected by T. We should consider 
the execution of 14 over a subtree F' being inserted or deleted. 

Given a subtree T' = {f , type, value), the execution of U over F' gives a tuple: 

{q',{li,. . . ,ln), {keyTrecj^^ [t',e],..., keyTreej^^ e])) (1) 

where q' is the state associated to the root of t', {li, . . . , In) is a n-tuple of lists and 
{keyTrecj^^ [t' ,e], ... , keyTree [t' , e] ) is a k-tuple containing the keyTree for each key. 

Notice that the n-tuple of lists has two distinct parts. Lists l\, ... ,lk represent keys and 
lists (fe+i, represent foreign keys. Each (j (1 <j< n) is a list of pairs, i.e., each (j 

has the form [ci , . . . , Cm] where each Ch is a pair containing an automaton configuration 
and a list of values. 

When performing an insertion, we want to ensure that F' has no “internal” validity 
problems (as, for instance, duplicated values for Kj). Thus, we define F' as locally valid 
if the tuple (1) respects the following conditions: (A) q' is a state in Q\ (B) for each list 
Ij (1 < j < k) we have: 

(i) if the root of t' is a target position for Kj then the number of values in Ij equals the 
number of elements composing the key Kj ; 

(ii) if the root of t' is a context position for Kj then the list Ij is [{Mj.e, [frwe])] ; 

(iii) if the root of t' is a position above the context positions for Kj then the list Ij is 
[ci, . . . , Cm], where each pair Ch does not contain [false] as its list of values. 

Notice that no condition is imposed on foreign keys. A subtree F' can contain tuple 
values referring to a key value appearing in T (and not in F'). 

In the following, we assume that subtrees being inserted in a valid XML tree are 
locally valid and we address the problem of evaluating whether an update should be 
accepted with respect to key and foreign key constraints. Before accepting an update, 
we incrementally verify whether it does not cause any constraint violation. To perform 
these tests, we need the context node of a key or foreign key. To this end, we define 
procedure findContexl that computes: 

- The context position p' for a key Kj (or a foreign key FKj) which is an ancestor 
of the update position p in the tree t. 
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- A list /' containing the key (or foreign key) values for that context position, consid- 
ering values carried up from the subtree being inserted or deleted. ^ 

The tests performed for insertion operation insert{T T') are presented next. Recall 
that T is valid and T' is locally valid. 

Algorithm 2 - Incremental tests for update operation insert(T, p, T') 

1. For each list Ij [] (1 < j < k) obtained in the execution of 14 over T' for each 
key Kj do 

a) If p is under a context node of Kj then 

i. Call findContext(p, Ij ), that returns a context position p' and I' = [vi, ...,Vr\- 

ii. For each list v in /' do 

If there exists a tuple kval in keyTrecj^. [t, p'] such that kval = v 
then the insertion violates Kj and must be rejected 
else the insertion respects Kj. 

b) If p is the context position or it is between the root and a context node of Kj 
then the insertion respects Kj . 

2. For each Ij ^ [] ({k + 1) < j < n) obtained in the execution of 14 over T' do 

a) Call findContext(p, Ij ), that returns a context position p' and I' = [ui , . . . , . 

b) For each list v in I' do: 

If there exists a tuple kval in the keyTrecj^. such that kval = v 
then the insertion respects the foreign key FKj. The reference counter that 
corresponds to kval will be incremented at the end of the procedure, if the 
insertion is accepted. 

else the insertion does not respect the foreign key FKj and must be rejected. 

3. If all keys and foreign keys, together with schema constraints [5], are respected 

then accept the update and perform the modihcations to T and all keyTrees. 
else reject the update. □ 

Before performing an insertion. Algorithm 2 tests if we are not adding key duplicates 
on T and if the new foreign key values correspond to key values. When we refer to a 
tuple in a keyTree, this tuple is obtained by concatenating the key values found inside 
target tags of this keyTree, taking into account a context position p' . The next example 
illustrates an insertion operation with respect to key and foreign key constraints. 

Example 5. We consider the update insert(T, 0200, F') presented in Exam- 
ple 3. The execution of U over F' gives the tuple: {qwine, 

[Bordeaux, 1990])], [ ]), {keyTreej^^ [t' , e])). 

We see that F' is locally valid and that the update affects only K\ . Pmctduie findContext 
returns the context position p' = 0 and the list I' = [{Bordeaux, 1990)]. We compare 
the tuples in I' with those in keyTree K i (Fig- 3) for context p' = 0. All these tuples are 
distinct and thus the insertion is possible for iLi . As no other key is affected, the insertion 
is accepted. □ 

In a similar way, we define incremental tests for the operation delete(p,F). These 
tests check if the deletion of a subtree rooted at a position p does not violate constraints, 
before actually removing the subtree. The details are given in [1]. 

^ Let Ij be the list of pairs obtained for Kj or F Kj by the local validity check. Procedure^ndCon- 
text executes the automaton M (composition of M" and Mj) starting from the configurations 
in Ij and using the labels associated to the ancestors of position p [1]. 




126 



M.A. Abrao et al. 



5 Conclusions 

This paper extends and merges our previous proposals [5,6]. In [5], we propose an 
incremental validation method, but only with respect to schema constraints. In [6] we 
just consider the validation from scratch of an XML document associated to only one key 
constraint. In the current paper, we deal with incremental validation of updates taking 
into account schema constraints together with several key and foreign key constraints. 
Our verification algorithm uses only synthesized values (i.e., values communicated from 
the children to the parents of a tree), making the algorithms suitable for implementation 
in any parser generator, or even using SAX [14] or DOM [19]. The algorithms presented 
here have been implemented using the ASF+SDF meta-environment [7] . The verification 
of keys and foreign keys uses KeyTrees, which can also be used for efficiently evaluating 
queries based on key values. 

Validity verification methods for schema constraints have been addressed by [5 , 1 0, 1 5 , 
1 6, 1 7, 1 8] . The validation of updates is also treated in [ 1 0, 1 7] . In [ 1 7] , schema constraints 
wrt specialized DTDs are considered and incremental validation is performed in time 
0{log^{n)), where n is the size of the document. As shown in [5], in terms of schema 
constraints, our incremental validation (wrt DTDs) is 0(m + 1), where m is the number 
of children of the update position. 

Key constraints for XML have been recently considered in the literature (for instance, 
in [2,4, 6, 8,9]) and some of their aspects are adopted in XML Schema. In our paper, the 
definition of integrity constraints follows the key specification introduced in [8]. As 
shown in [11], it is easy to produce examples of integrity constraints that no XML 
document (valid wrt a schema) can verify. In our work, we assume key and foreign key 
constraints consistent with respect to a given DTD. 

In [9] a key validator which works in asymptotic linear time in the size of the 
document is proposed. Our algorithm also has this property. In contrast to our work, 
in [3,9] schema constraints are not considered and foreign keys are not treated in details. 
In [4] both schema and integrity constraints are considered in the process of generating 
XML documents from relational databases. Although some similar aspects with our 
approach can be observed, we place our work in a different context. In fact, we consider 
the evolution of XML data independently from any other database sources (in this context 
both validation and re- validation of XML documents can be required). 

We are currently studying the following lines of research; (i) An extension of our 
method to deal with other schema specification, for instance XML-Schema and special- 
ized DTDs, (ii) An implementation of an XML update language such as UpdateX [12] in 
which incremental constraint checking will be integrated. To this end, we shall consider 
a transaction including several updates and check validity of its result. 



Acknowledgements. We would like to thank the anonymous referees for their sugges- 
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Abstract. In the last few years, XML has been widely used as a logical 
data model, and several database applications are modeled in XML. To 
model a database application in XML, we should first come up with a 
conceptual design for representing the application requirements, and then 
translate this conceptual design to XML. Existing conceptual models 
like the ER (Entity Relationship) model, UML and ORM do not have 
modeling capabilities to represent main features provided by XML, such 
as union types. In this work, we extend the ER model with additional 
features; we call our conceptual model as EReX (ER extended for XML). 
Translating an EReX design to XML enables us to make use of the 
different features provided by XML. Our approach further enables us to 
study a fundamental problem facing XML database community today: 
what structural and constraint specification should be provided in XML 
so that any generic database application can be modeled in XML. 



1 Introduction 

Over the last few years, XML (extensible Markup Language) [5] published by 
W3C has established itself as a promising logical data model, that is used widely 
for database applications. There are at least two good reasons for this widespread 
use of XML: (a) necessity - XML is the lingua franca for information exchange 
over the web, therefore if we need to exchange our data with web applications, we 
need to model our data in XML (b) capability - XML provides several favorable 
features often necessary for modeling present day applications such as union 
types and ordered relationships. Our work focuses on the latter aspect, and we 
examine how we can use these XML features effectively for database applications. 

Database design process is typically done in different stages [3]. First, during 
the conceptual design phase, the database designer represents the application 
requirements as a schema in a conceptual model. Examples of conceptual models 
include ER [7], UML [16], and ORM [12]. Then in the logical design phase, the 
conceptual schema is represented as a schema in a logical model such as relational 
model [8], object-relational model or XML. There is a third phase where the 
logical schema is translated to a physical schema. In this work, we describe our 
conceptual model called EReX, and an algorithm to translate an EReX schema 
to “good” XML schemas. We do not consider the physical schema or the database 
implementation, which could be in native XML databases, relational databases, 
object-relational databases etc. 
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We use our work to study a fundamental problem facing XML database com- 
munity today. There are different schema languages for XML; three most pop- 
ular ones are DTD [5], XML-Schema [19], and RELAX-NG [15]. Each of these 
schema languages have different structural and constraint specification charac- 
teristics. In [14], the authors study the structural specification characteristics of 
the schema languages and show the following: (a) the most expressive schema 
languages such as RELAX-NG are closed under different set operations such as 
union, intersection and difference, whereas the less expressive schema languages 
such as DTD and XML-Schema are closed only under intersection, (b) DTD and 
XML-Schema guarantee unambiguous “type assignment”, whereas RELAX-NG 
schemas may yield ambiguous type assignment. Database applications require 
both unambiguous type assignment as well as closure properties. XQuery ap- 
proaches this problem as [18]: we require the XML design to specify unambiguous 
types, but we do not place this restriction during XML query processing. In our 
work, we show that XQuery’s approach is good for database applications in that 
any generic database application can be modeled in XML using unambiguous 
types. 

The rest of the paper is organized as follows. In the next section, we present 
the EReX conceptual model; we first present the ER model, mention reasons 
for choosing ER model, and our extensions to the ER model. In Section 3, we 
study different structural and constraint specification schemes for XML, and 
present XGrammar, a grammar based notation for XML schema languages such 
as DTD, XML-Schema and RELAX-NG. In Section 4, we study how to translate 
an EReX schema to XGrammar, and study the characteristics of the resulting 
XML schemas. We conclude with interesting open research problems in Section 5. 

2 EReX Conceptual Model 

Entity Relationship (ER) model defined by Ghen in the 1970s has been widely 
used as a conceptual model ever since [7]. A schema is represented in the ER 
model using a diagrammatic notation called ER diagram. In this section, we 
examine the features of ER model, and our extensions. 

2.1 Basic Features of ER 

In the ER model, we specify structures such as entity types, relationship types, 
and attributes. We denote an entity type by E^, an entity instance (also called 
entity, for short) of an entity type Ei as Ci ., and the set of entities of an entity 
type in a database instance as I{Ei). Figure 1 shows an entity type Student; 
L (Student) = {si, s2, s3}. A relationship type is denoted as Ri, and it represents 
an association between entity types. We denote a relationship instance (also 
called relationship) of relationship type Ri as , and the set of relationships 
in a database instance as I(Ri). Gonsider a relationship type Ri between entity 
types El, E2, ..., En. I(Ri) defines a n-ary relation between the sets I(Ei), 
/(E 2 ), . . . , I(En). A relationship r G I(Ri) that associates entities ei G I(Ei), 
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62 G I{E 2 ), ■■■, e„ G I{En) is also represented as (ei, 62 ,... ,e„). Figure 1 
shows a binary relationship type AdvisedBy between entity types Student and 
Professor. Figure 1 (c) shows I(AdvisedBy) = {rl,r2}. See that rl = (si, pi), 
and r2 = (s2,pl). 

An entity type or a relationship type may also define attributes. Consider an 
attribute A of an entity type E (or relationship type R). A maps values of I{E) 
(or I{R)) to “values” of A. In Figure 1, entity type Student defines two attributes 
snumber and sname; snumber maps si — >■ 1, s2 — >■ 2, s3 — >■ 3; attribute sname 
maps si — >■ Dave, and s3 — >■ Grey, note that sname does not map s2 to any 
value. Also relationship type AdvisedBy defines an attribute project. 




snum sname 



pname dept 



(a) Example ER schema 

I (Student) I (AdvisedBy) I (Professor) 




(c) Showing I {AdvisedBy) 



snumber 




(b) Showing I (Student) 
and the attribntes 



Fig. 1. Example ER schema and a corresponding database instance fragment. 



We focus on two kinds of constraints that can be specified in the ER model: 
key constraints and cardinality constraints. Key constraints are specified for an 
entity type; we say the key for an entity type Ei is its attribute if for 
any database instance, Ai^ is a one-to-one function from I{Ei) to values of Ai^. 
The key for an entity type may be composite (that is, multiple attributes). 
The key for an entity type Ek is the attributes (A^j^, . • . ,Ak^), if for any 

database instance, we have a one-to-one function from I{Ek) to the values in 
Afcj X Afe 2 X . . . X Afe^. In Figure 1 (a), we define the key for Student is (snumber), 
and the key for Professor is (pname, dept). 

Cardinality constraints are specified for each entity type in a relationship 
type. We say that the cardinality constraints for an entity type if in a relation- 
ship type R is (min, max), if for any database instance, any e G I{E) must 
appear in at least min instances of I{R), and no e G I{E) appears in more than 
max instances of I{R). We denote unbounded cardinality by *. Based on the 
cardinality constraints, binary relationship types can be called 1 : 1, 1 : n, or 
m : n [3] . 
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ER also has the notion of roles. An entity type E in a, relationship type R can 
be said to play a role, denoted 1. For a database instance, we define I{1) C I{E) 
as the set of entities that appear in R in the role 1. In Figure 1, Student plays the 
role of advisee, and Professor the role of advisor in AdvisedBy; I {advisee) = 
{sl,s2}, and I{advisor) = {pi}. 

2.2 ER Model as Basis for Studying XML Requirements 

There are several good reasons for studying structural and constraint specifi- 
cations requirements for XML based on the ER model. First, the ER model 
has always been considered a representative of real world database applications. 
Because of this, algorithms to translate an ER schema to schemas in different 
logical models have been studied extensively. A simple algorithm to translate an 
ER schema to relational schema as given in [11] yields the relations shown in 
Table 1. 



Table 1. Relational Instance corresponding to the ER schema in Figure 1 



Student 



snumber 


sname 


1 


Dave 


2 


null 


3 


Greg 



Professor AdvisedBy 



pname 


dept 


office 


John 

Dave 


CS 

Math 


140 FL 
230 FL 



snumber 


pname 


dept 


project 


1 


John 


CS 


DBl 


2 


John 


CS 


DB2 



The above relational instance also shows that the ER model gives a clear 
interpretation of some basic concepts in database design such as nulls, compared 
to other models such as relational model. Another important concept in database 
design is normalization. Normalization can also be explained using ER model [7] 
as: if we assume any functional dependency A ^ B implies that there is an entity 
type with A as the key and attribute (s) B, then a “correct” ER schema and a 
“correct” translation algorithm will generate a relational schema guaranteed to 
have no redundancy (or according to [2] the entropy of any position in any 
instance is non-zero). In short, the resulting relational schema is in BCNF. 

There has been previous work to study XML requirements from other models. 
In [13], the authors study in detail how a relational schema can be translated to 
an XML schema, and study especially the structural specification requirements 
for XML. However, because of the features of the relational model, the XML 
schemas generated have no union types, no recursive types, and is a “local tree 
grammar” such as DTD. Translation from UML and ORM are studied in [17] 
and [4] ; however they also have the same issues: the resulting XML schemas do 
not use union types. Further, these translation algorithms are largely ad hoc and 
it is difficult to formalize the characteristics of the resulting XML schemas. In 
this work, we will study how to make use of XML features such as union types, 
which are important to real world applications, for example, we often need to 
model that an address has either a city, state or a zip. 
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2.3 Extensions to the ER Model 

The ER model was originally influenced by the relational model, and lacks fea- 
tures necessary to make use of features in XML. We therefore extend the ER 
model, we call this extended model as EReX model. The extensions are: (a) 
structural specification called categories, (b) constraint specifications called cov- 
erage constraints, and order eonstraints. 



Categories. Categories is a kind of relationship type similar to ISA relationship 
types. Given entity types E,Ei,E 2 , ... E„, we can specify E\, E 2 , . ■ . En as 
categories of E, if in any database instance, I{Ei) C I{E), for every 1 < f < n. 
We represent this in an EReX schema with arrows from the EiS to E. This is 
different from ISA relationship type, as well as from the ECR model [9]: (a) ISA 
requires that a key constraint be specified for E (b) ECR [9] requires I{E) C 
I{Ei) LII{E 2 ) U . . . LII{En), and allows I{Ei) ^ I{E). These constraints are not 
imposed on the categories in the EReX model. Two examples of categories are 
shown in Figure 2. Note that in Figure 2 (b), no key constraint is specified for 
Article] in Figure 2 (a), there can be instances of Person who are not instances 
of PersonCity or PersonZip. 




(a) PersonCity and PersonZip are 
categories of Person 




" a s, a 

o 



(b) Book and Paper are categories of Article 



Fig. 2. Example Categories. 



Coverage Constraints. Coverage constraints are specified on entity types or 
roles. There are two kinds of coverage constraints (a) total coverage: Given entity 
types E,Ei, E 2 , ■ ■ ■ ,E„, where every Ei, 1 < i < n is a category of E, we 
specify that Ei U E 2 U . . . = if if for any database instance, i(ifi) U /(if 2 ) U 

. . .I{En) = I{E). We can specify total coverage on roles as: given entity type 
E, and roles li,l 2 ,... ,ln, where every k, 1 < i < n is a, role played by E 
in some relationship type, we specify li U I 2 A . . . In = E, if for any database 
instance, I{h) U 1(h) U . . . U /(?„) = I{E). (b) exclusive coverage: Given entity 
types E, Ei,E 2 , where Ei, E 2 are categories of E, we specify fl E 2 = </>, if for 
any database instance, I{Ei) fl I{E 2 ) = 4>. Similarly given entity type E, and 
roles h,l 2 , where h,l 2 are roles played by E, we specify /i fl I 2 = if for any 
database instance, I{h) n/(? 2 ) = Examples of coverage constraints are shown 
in Figure 3. Coverage constraints have been studied previously, in [9]. 
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ISBN title 




{personBook n personPaper = (j>, 
personBook U personPaper = Person} 







(c) Coverage Constraints: 
{PersonCity n PersonZip = (j>, 
PersonCity U PersonZip = Person} 



name venue 




(b) Coverage Constraints: 

{conf Paper n journalPaper = (p, 
conf Paper U journalPaper — Paper}. 




(d) Relationship Type with Order 
Constraints. The books authored by 
a person are ordered. 



Fig. 3. (a) and (b) show coverage constraints on roles; (c) shows coverage constraints 
on entity types; (d) shows order constraints. 



Order Constraints. Order constraints are specified for entity types in a rela- 
tionship type. Consider relationship type R between entity types Ei,E 2 , ■ ■ ■ E^. 
We specify that Ei, 1 < i < n is ordered in R, if in any database instance the 
relationship instances in I{R) where any C I(Ei) appear is ordered. We rep- 
resent this constraint in an EReX schema with a thick line between Ei and R. 
See figure 3 (d). 



3 XGrammar 

We specify an XML schema using XGrammar, a tree grammar based notation for 
XML schema languages such as DTD, XML-Schema and RELAX-NG, motivated 
by [14]. In this section, we define XGrammar and discuss different structural 
and constraint specification schemes for XML. We use G to denote a schema in 
XGrammar. We assume the existence of a set N of non-terminal symbols, a set 
E of element names, a set A of attribute names, and a set r of atomic data types 
as defined by [19], such as string, integer, ID, IDREF(S) etc. We will consider only 
the atomic data types ID, IDREF and IDREFS. We use the following notations 
for regular expressions: e denotes the empty string, -|- denotes union, denotes 
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concatenation, “o’” denotes “a + e”, “a*” denotes Kleene star, and “a+” denotes 
“a, a*”. 

A schema in XGrammar is denoted by a 6-tuple G = {N, E, A, S, P, S), where 

— is a finite set of non-terminal symbols (also called types), where N C N. 

— if is a finite set of element names, where E C E. 

~ A is a finite set of attribute names, where A C A. We define a type t G t 
for any attribute a G A. 

— S' is set of start symbols, where S C N. 

— P is set of production rules of the form X ^ x (RE), where X G N , x G E, 
and RE is a regular expression: RE ::= e | r | @a | Y \ {RE + RE) \ 
{RE, RE) I (PP) ■ I {RE)* I {RE) + , where t Gt, aG A,Y G N. 

— E is the set of constraints, which is defined later in this section. 

An example XGrammar schema and an instance document is given in Ta- 
ble 2. Type assignment (interpretation) assigns a “valid” type for each element 
in an instance [14]. Given an instance, we denote by I{Ni), where Ni G N, the 
set of elements in the instance that are assigned type Ni. In Table 2, we have 
I {Person) = {pl,p2}. We say type assignment for an instance is unambiguous, 
if it has only one possible type assignment. The type assignment for the instance 
in Table 2 is unambiguous. 

Before we define constraints, let us define path expressions for an XGrammar 
schema G. A path expression p is given by: p ::= x \ @a | parent :: x \ p/p, where 
X G E, and a G A. The semantics for a path expression is defined by XPath [20] . 
We let PE denote the set of path expressions. 

We can specify three kinds of constraints in E for an XGrammar schema G: 

~ IDREF constraints. Gonsider attribute a G A. If a is of type IDREF, we 
define the “target type” of a as a :: I DREE ^ REi, where REi ::= X \ 
{REi + REi). If a is of type ID REFS, we define the “target types” of a as 
a :: IDREFS RE 2 , where RE 2 ::= e | A | {RE 2 + RE 2 ) \ {RE 2 , RE 2 ) \ 
{RE 2 )‘ I {RE 2 )* I {RE 2 )~^. Here X G N. We specify the constraint if for any 
instance, the value of a refer to type(s) that “conform” to RE\ (or RE 2 ). 

— Key constraints. We specify a key constraint as: key{X) = {pi,p 2 , . . . ,Pn)', 
where X G N , and pi G PE, 1 < i < n. The semantics is as specified in [10]. 

— Foreign-key constraints. We specify a foreign key constraint as: X{pi, p 2 , 
..., pn) REFERENCES Y{qi,q 2 , . . . ,qn), here X,Y G N, and Pi,qi G PE, 
1 < i < n. The semantics is as specified in [10]. 



3.1 Structural Specification Schemes for XML 

Before we study the different structural specification schemes for XML, let us 
define competing non-terminal symbols in an XGrammar schema G. Two differ- 
ent non-terminal symbols X,Y G N are said to be competing if there exist two 
production rules in P, such that X -G x{REi), and Y -G x{REj), where x G E. 
DTD is a local tree grammar and imposes the restriction that in a schema, there 
should be no competing non-terminal symbols. XML-Schema is a single-type tree 
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Table 2. Example XGrammar schema and an instance XML document 



N = {Root, Person, Book, Paper, Review} 
E — {root, per son, book, paper, review} 

A — {name, city, state, zip, BID, btitle, 

ISBN, PID,ptitle, art, rating} 
S = {Root} 

P = {Root — >■ root {Person*), 

Person person {©name, 

{{©city, ©state) + ©zip) 

{Book** + Paper**), Review^), 
Book book {©btitle,©! S BN , 
©BID), 

Paper — >■ paper {©ptitle, ©PID), 
Review — >■ review{©ARef, ©rating)} 
E — IDREF constraints: 

{A7?e/::IDREF'^ {Book + Paper)} 
Key constraints: 

{key{Person) = {©name), 
key {Book) = {©ISBN), 
key {Paper) = {©ptitle), 
key{Review) = 

{parent :: person/©name, ©ARef)} 



(a) Example XGrammar Schema 



<root> 

<person name=‘Nl’ city=’Gl’ 
state=’Sr > 

<book btitle=‘Tl’ ISBN=‘I1’ 
BID^’Bl’ > 

<book btitle=‘T2’ ISBN=‘I2’ 
BID=’B2’ > 

<review ARef=‘Pl’ rating=‘9’> 
<review ARef=‘B2’ rating=‘9’> 
</person> 

<person name=‘N2’ zip=‘90095’> 
<paper ptitle=‘T3’ PID=‘P1’> 
<review ARef=‘Bl’ rating=‘9’> 
<review ARef=‘B2’ rating=T0’> 
</person> 

</root> 



(b) Instance XML document 




(c) Tree representation of (b) 



grammar, and imposes the restriction that in any production rule X — >■ x{RE), 
there are no competing non-terminal symbols in RE. RELAX-NG is regular tree 
grammar, and imposes no restrictions. Local and single-type tree grammars guar- 
antee unambiguous type assignment for any instance of an XGrammar schema, 
whereas regular tree grammars do not give this guarantee [14]. 



3.2 Constraint Specification Schemes for XML 

Gonstraint specification schemes for XML differ in whether keys and foreign 
keys are specified for types or path expressions. In XGrammar and in [10], they 
are specified for types; for XML-Schema [19], and in [6], they are specified for 
path expressions. If keys are specified for types, we need unambiguous type 
assignment; otherwise, we do not need unambiguous type assignment. 



4 Translating an EReX Schema to an XGrammar Schema 

In this section, we will describe an algorithm to translate an EReX schema to 
an XML schema in XGrammar. Gompared to the translation algorithms from 
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ER schema to relational schema [11], we have to consider the additional fea- 
tures of EReX and we also have more options for representing an EReX feature 
in XGrammar. We will denote the input EReX schema as E, and the output 
XGrammar schema as G = {N, E, A, P, S, E). As an example, we shall consider 
the EReX schema shown in Figure 4. 




city state zip 



Coverage Constraints: 
{personBook Cl personPaper 

= 4 >, 

personBook U personPaper 
— Person, 

PersonCity n PersonZip 

= 0 , 

PersonCity U PersonZip 
= Person, 

Book U Paper = Article} 



Fig. 4. Example EReX schema. 



Intuition behind our algorithm. The main intuition behind our algorithm 
is to make use of the different coverage constraints to come up with an XML 
schema that has only the required number of constructs. For example, con- 
sider the coverage constraint: personBook Cl personPaper = (p-, this says that 
a person who wrote books did not write papers. If the books written by a per- 
son and papers written by a person are captured by having books and papers 
as “subelements” of person, then we may specify this coverage constraint as: 
Person — >■ person{Book* + Person*). Similarly, we observe in Figure 4 that 
Article has no attributes, so we try to see if we can come up with an XML 
schema with no non-terminal symbol corresponding to Article. 

We will present our algorithm in three steps: in the first step (initialization 
step), we look at the entity types and their attributes, and come up with a set 
of non-terminal symbols, element names, attribute names, and production rules. 
In the second step, we look at the relationship types, and capture them. In the 
third step, we try to capture coverage constraints, and order constraints. 

4.1 Initialization 

As part of initialization, we do the following: 

1. Define a non-terminal symbol in N corresponding to every entity type Ei in 
E. Also define a new non-terminal symbol Root, which will be the start sym- 
bol. For the example in Figure 4, we get N = {Root, Person, PersonCity, 
PersonZip, Article, Book, Paper}. 
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2. Define element names as follows: define an element name root; define an 
element name corresponding to every entity type Ei that has a key constraint 
or that is not a category of any other entity type Ek- For Figure 4, we get 
E = {root, person, article, book, paper}. 

We then associate each non-terminal symbol in N with an element name 
in E as follows. Let function t : N ^ E represent this association, t is 
defined as: t{Root) = root. Let Ni G N correspond to entity type Ei in 
E. If there exists Ci G E corresponding to Ei, then t{Ni) = ep, otherwise 
t{Ni) = t{Nj), where Nj G N corresponds to Ej and Ei is a category of 
Ej. For example, consider PersonCity G N; it is a category of Person, and 
hence we will define t(PersonCity) = person. For Figure 4, we also get 
t{Root) = root, t{Person) = person, t(PersonZip) = person, t{Article) = 
article, t{Book) = book, t{Paper) = paper. 

3. Define an attribute name in A corresponding to an attribute of any entity 
type or relationship type in E. For our example, we get A = {name, city, 
state, zip, ISBN, btitle, ptitle, rating}. 

4. Define production rules corresponding to every non-terminal symbol Ni G N 
as: Ni ^ m{REi), where = t{Ni), and REi includes the attributes in A 
corresponding to the attributes of Ei in E. The production rules we obtain 
are shown in Table 3. 

5. Define key constraints as: for every key constraint in E that says key for Ei 
is (ai,a 2 ,... ,a„), we define a key constraint in E, key{Ni) = (@ai, @ 02 , 
. . . , @a„). The key constraints we obtain are shown in Table 3. 



Table 3. Example XGrammar after initialization 

N = {Root, Person, PersonCity, PersonZip, Article, Book, Paper} 

E = {root, person, article, book, paper} 

A — {name,city, state, zip, I SBN ,btitle,ptitle, rating} 

S = {Root} 

P = {Root — >■ root{e). Person — >■ person{@name), 

PersonCity — >■ person{@city, ©state), PersonZip — >■ person{@zip) , 

Article — >■ articleie). Book — >■ book{@ISBN, ©btitle), Paper — >■ paper {©ptitle)} 
E — Key Constraints: 

{key(Person) = {©name), key{Book) = {©ISBN), key{Paper) = {©ptitle)} 



From now on, we will use the notation Ni G N corresponds to Ei G E, 
has production rule Ni -G- ni{REi), and key constraint (if any) key{Ni) = 
{pii,Pi^, . . . ,Pit ^. ). Also for any regular expression RE, {RE)^ denotes {RE). 

We define a method addID{Ni), to add an ID attribute for non-terminal Ni. 
This method checks if there exists an attribute of type ID in REp, if yes, then 
nothing is done; if no, then we set A = AU NilD, where NilD is of type ID; 
also REi is set as REi = {REi, NilD). 

Similarly, we define a method addRef{Nj, Ni), to add an IDREF attribute 
referring to Ni for the non-terminal Nj. This method, sets A = AU NiRef, 
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where NiRef is of type IDREF; and adds an IDREF constraint to R as R = 
R U N,Ref :: IDREF N,. 

4.2 Translating Relationship Types 

Consider the relationship type R between entity types E\,E 2 , ■ ■ . , En, with at- 
tributes oi, 02 , . . . , am- Such relationship types are translated as follows: 

1. If i? is a binary relationship type (n = 2), where cardinality of one of the 
participating entity types, say E 2 is (1, 1), R is translated as: 

— If N 2 does not appear in the “right hand side” (RHS) of any production 
rule, then rewrite the production rule for Ni and N 2 by setting REi = 
{REi.N!^), RE 2 = {RE 2 ,@ai,@a 2 , ■ ■ ■ ,@am)- Here o depends on the 
cardinality of Ei in R: if this cardinality is (0, 1), o ='\ if (1, 1), o = 1; 
if (0, *), o = *; if (1, *), 0 = E. 

— If 7 V 2 appears in the RHS of some rule, then use IDREF as: addlD(Ni); 
addRef{N2, Ni), RE 2 = {RE2,@NiRef,@ai,@a2m-- ,@am)- 
In Figure 4, we have two relationship types where cardinality of one of the 
entity types is (1, 1): AuthoredBook, and Author edPaper. These get repre- 
sented as Person — >■ person{@name, Book* , Paper*). 

2. If i? is a binary recursive relationship type (n = 2, ifi = E 2 ), where one of 
the cardinalities is (0, 1), translate R as: if Ni does not appear in the RHS 
of any production rule, then set REi = (REi, N° , (@ai,@a 2 , ■ ■ ■ ,@am)^)', 
here o can be ’ or *, depending on the other cardinality for R. If appears 
in RHS of some rule, then represent it as in Step 3. 

3. If R is any binary relationship type (n = 2), where cardinality of E 2 is 
(0, 1) and Steps 1 or 2 cannot be applied, then translate R as: addlD(Ni); 
addRef{N2, Ni); RE 2 = {RE 2 , {@NiRef,@ai,@a2, ■ ■ - @am)^)- 

4. If i? is a binary relationship type, and we cannot apply Steps 1, 2 or 3, 
then i? is a TO : n relationship type. This is translated as: create a new non- 
terminal symbol Nfi, and set N = N U Nji; create a new element name en 
and set if = if U en] set REi = {REi, Nf^), where o is * or +; addID{N 2 ); 
addRef{Nu, N 2 ); add a new production rule Nr — >• en{@N 2 Ref, @oi, @ 02 , 
. . . , @am). The key for Nr is defined as: 

— if key{Ni) is defined, then key{NR) = {parent :: e\jp\^, parent :: ei/pij, 
. . . , parent :: ei/pi^^,@N 2 Ref). 

— if key{Ni) is not defined, then addID{Ni); let attribute of type ID in 
REi be NiID. Now, key{NR) = {parent :: ei/@NiID,@N 2 Ref). 

For Figure 4, Review is a to : n relationship type, and we get: Person — >■ 
person {©name, Book*, Paper*, Review^), Review — >■ review {©ARef, 
©rating). Article — >■ article {©AID), ARef :: IDREF ^ {Article), and 
key{Review) = {parent :: person/©name, ©ARef). 

5. If ii is a non-binary relationship type, create a new non-terminal symbol. 
Nr, and a new element name cr. Then: REi = {REi,Nffj, o G ,* ,+ }; 
addID{Ni), 2 < i < n; addRef{NR,Ni), 2 < i < n; add a new production 
rule Nr — >■ CR{©N 2 Ref, ©N^Ref, . . . , ©NnRef, ©a\, @ 02 , . . . , @ 0 ^,). The 
key for Nr is defined similar to that for to : n relationship types. 
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After translating all the relationship types for the example in Figure 4, we 
obtain the XGrammar as shown in Table 4. 



Table 4. Example XGrammar after translating relationships 

N = {Root, Person, PersonCity, PersonZip, Article, Book, Paper, Review} 

E = {root, per son, article, book, paper, review} 

A = {name, city, state, zip, ISBN, btitle,ptitle, rating, AID, ARef} 

S = {Root} 

P = {Root — >■ root{e). Person — >■ person{@name. Book* , Paper* , Review^), 
PersonCity — >■ person{@city, ©state), PersonZip — >■ person(@zip). 

Article — >■ article{@AID), Book — >■ book{@ISBN,@btitle), 

Paper — >■ paper (©ptitle)}. Review review{©ARef, ©rating) 

E = IDREF constraints: ARef :: ID REF Article 

Key Constraints: {key(Person) = (©name), key(Book) = (©ISBN), 
key(Paper) = (©ptitle), key(Review) = (parent :: person/©name,©ARef)} 



4.3 Other EReX Features 

1. Order constraints are represented as: Let i? be a n-ary relationship type as 
in the previous subsection; let Ei in R be ordered. R would have been rep- 
resented in one of two ways: (a) N 2 (or Np) is in the production rule of Ni 
(b) N 2 (or Nfi) has IDREF attribute NiRef. In (a), order is already cap- 
tured; in (b) the order is captured by: RE 2 = (RE 2 ,@NiOrder), or REr = 
(REp, @NiOrder), as the case may be. The two order constraints in Figure 4 
are captured in Person — 1- person(@name, Book* , Paper* , Review'^). 

2. Removing unnecessary non-terminal symbols is done as: 

— Remove any non-terminal Ni G N, where REi = e, and Ni does not 
appear on the RHS of any production rule. 

— Remove any non-terminal Ni for which key(Ni) is not defined, there 
exists entity types Ei, E 2 , ■ ■ ■ En that are categories of Ei and we have the 
constraint: EiC E 2 C . . . En = E. Set REk = (REk, REi), for 1 < fc < n. 
If Ni appears in the RHS of some production rule: if no Nk, 1 < k < n 
appears in the RHS of any production rule, then Ni is replaced with 
(Ni -I- IV 2 -I- . . . -I- Nn); otherwise, we replace Ni with a new attribute 
a; A = A U a; a :: IDREF (Ni -|- A ^2 + ■ • • + Nn); addlD(Nk), for 
1 < fc < n. If appears as the target type of some IDREF attribute, 
then Ni is replaced with (A^i-|-A ^2 + - • - + Nn); addlD(Nk), for 1 < fc < n. 
For Figure 4, we remove the non-terminal Article, and get: ARef :: 
IDREE (Book + Paper); Book — >■ book(@ISBN,@btitle,@BID), 
Paper — >■ paper (@ptitle, @PID). 

— Remove any non-terminal Ni for which key(Ni) is not defined, Ni does 
not appear in the RHS of any production rule, REi does not have NilD, 
and Ei is category of Ej as: REj = (REj,REj). For Figure 4, we remove 
PersonCity and PersonZip and get Person — 1- person (©name, (©city, 
©state)"^, ©zip'^ , Book*, Paper*, Review^). 
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3. All exclusive coverage constraints may not be captured in XGrammar. Rep- 
resent exclusive coverage constraints as: For roles (or categories) - ■ ■ , 
In of Ek, if li n Ij = (j>, for 1 < i < j < n, and Nk — >■ ek{r,Qi^ ,Q'^ , 

, rewrite 



1 ? 



where Q°' “corresponds” to k, and o^’s in 
the rule as: Nk -)> ek{r,{Q°^ + Q°^ Q°”)). For Figure 4, using 

PersonCity fl PersonZip = </> and using personBook fl personPaper = (f), 
we get Person — >■ person {@name, {{@city,@statey + @zip"^), {Book* + 
Paper*), Review'^). 

4. All total coverage constraints may not be captured in XGrammar. Represent 

total coverage constraints as: For roles (or categories) ,ln of Ek, 

if li D I 2 k) ... In = Ek, and Nk — >■ efc(ri, {Qi^ + -I- r 2 )), 

where Q°' “corresponds” to k, and Oj’s in ,* rewrite the rule as: Nk 

ek{ri, (<5i^+(52^+. . .+Qn"+r 2 )). We get o' as: if o^ =y o' =^; if o* =*, o' =+; 
otherwise o' = Oj. For Figure 4, using PersonCity U PersonZip = Person 
and using personBook U personPaper = Person, we get Person — >■ person 
{©name, {{©city, ©state) + ©zip), {Book'^ + Paper'^), Review'^). 

5. Gategories are related as: consider two non-terminal symbols Ni,Nj G N, 
such that Ei is a category of Ej, call addID{Nj); addRef{Ni,Nj)-, N^ —>• 
e^{REi,©NjRef). 



6. Ensure that G is a single-type tree grammar. For this, check if any production 
rule has W and Nj where Ci = Cj. In this case, introduce two new non- 
terminal symbols iV', N'^; introduce two new element names Cj', Cji; replace 
Ni with Nl, and Nj with Nj, and set: iV' — >• Ci'{Ni) and Nj — >• Cj'{Nj). 

7. Find the non-terminal symbols that do not appear on the RHS of any rule. 
Let they be Ni,N 2 ,... ,Nn. Set Root — >• root(iV*, . . . ,N*). For Fig- 
ure 4, we get Root — >■ root{Person*). 



The final XGrammar schema from Figure 4 is given in Table 2. 



4.4 Characteristics of Our Translation Algorithm 

It is important to formalize the characteristics of the XGrammar resulting from 
our translation algorithm. We only state the characteristics; the proofs follow 
from our algorithm. Let us first look at information preservation. ER to relational 
translation loses some cardinality constraints; similarly when we use IDREF at- 
tributes in XGrammar, we could potentially lose some cardinality constraints 
(such as the minimum cardinality). All order constraints specified in an EReX 
schema are captured in XGrammar. However, not all exclusive and total cover- 
age constraints may be captured. Also it is important to note that, XGrammar 
could also add some unnecessary order constraints into the schema, which were 
not originally present in the EReX schema. For example, the XGrammar in Ta- 
ble 2 adds that the reviews of a person are ordered. However, other than these, 
all other constraints in the EReX schema are captured in XGrammar; and no 
additional constraints get specified in XGrammar. 

Let us examine redundancy in terms of redundancy in attribute values. We 
see that any attribute value that appears in the EReX instance (such as si — >■ 
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Dave) is represented exactly once in the XGrammar instance. Therefore, if there 
was no redundancy in the EReX schema the resulting XGrammar schema also 
has no redundancy. In other words, if we assume any functional dependency 
A ^ B implies that there is an entity type with key A and attribute(s) B, then 
a “correct” EReX schema will produce a normalized XGrammar schema (in 
XNF [1]) by the above algorithm. We can show a much stronger result, that is, 
any update to an attribute value in EReX translates to an update of an attribute 
value. Similarly the insert (or deletion) of an entity or relationship translates to 
adding (or deleting) elements and attributes. 

Let us examine other characteristics of the resulting XGrammar schemas. We 
see that the resulting XGrammar generates XML instances with maximal height, 
and it makes use of union types and recursive types. Further, we assume that 
any database application can be modeled in EReX. Therefore, we get our main 
result: any database application can he modeled in XGrammar, where structures 
are specified as a single-type tree grammar, and constraints are specified on types. 



5 Conclusions and Future Work 



In this work, we studied the problem of designing an XML schema for a given 
application, using conceptual modeling techniques. We came up with the EReX 
conceptual model, and studied an algorithm to translate EReX schemas to XML 
schemas. Our algorithm was able to make use of features such as union types 
and recursive types provided by XML. We further used this approach to study a 
fundamental problem facing the XML database community today: what struc- 
tural and constraint specification schemes are needed in XML for modeling any 
database application. We conclude that any database application can be modeled 
in XML as a single-type tree grammar, with constraints specified on types. 

There exist lot of open research problems in XML data models. For example, 
XML has the notion of document order, where all the elements are ordered; we 
do not have a corresponding notion in EReX schema, and do not understand 
what global document order means in a real world application. Another 
important research issue is that in order to capture the constraints in EReX 
omitted by XGrammar we might need a first order constraint language similar 
to that for relational model. Also reverse engineering XML schemas to EReX 
schemas is important, and it will provide a different framework for reasoning 
about problems such as verifying consistency of an XML schema. This problem 
is studied in [6,10,2]. However there exist several open problems in these areas, 
and our framework might be helpful to come up with a better understanding of 
these problems. 
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Abstract. We show that it is possible to extend a general-purpose programming 
language with a convenient high-level data-type for manipulating XML docu- 
ments while permitting (1) precise static analysis for guaranteeing validity of the 
constructed XML documents relative to the given DTD schemas, and (2) a run- 
time system where the operations can be performed efficiently. The system, named 
Xact, is based on a notion of immutable XML templates and uses XPath for de- 
constructing documents. A companion paper presents the program analysis; this 
paper focuses on the efficient runtime representation. 



1 Introduction 

There exists a variety of approaches for programming transformations of XML doc- 
uments. Some work in the context of a general-purpose programming language; for 
example, JDOM [17], which is a popular package for Java allowing XML documents to 
be manipulated using a tree representation. A benefit of this approach is that the full ex- 
pressive power of the Java language is directly available for defining the transformations. 
Another approach is to use domain-specific languages, such as XSLT [7], which is based 
on notions of templates and pattern matching. This approach often allows more concise 
programs that are easier to write and maintain, but it is difficult to combine it with more 
general computations, access to databases, communication with Web services, etc. 

Our goal is to integrate XML into general-purpose programming languages to make 
development of XML transformations easier and safer to construct. We propose Xact, 
which integrates XML into Java through a high-level data-type representing immutable 
XML fragments, a runtime system that supports a number of primitive operations on 
such XML fragments, and a static analysis for detecting programming errors related to 
the XML operations. 

The XML fragments in Xact are immutable for two reasons: First, immutability is 
always a judicious design choice {“I would use an immutable whenever I can” , James 
Gosling [26]); and second, immutability is a necessity for devising precise and efficient 
static analyses, in particular, of validity of dynamically constructed XML documents 
relative to the DTD schemas. The Xact system consists of a simple preprocessor, a 
runtime library, and a program analyzer. The main contribution of this paper is the 
description of the Xact runtime system. We present a suitable runtime representation 
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for XML templates that efficiently supports the operations in the Xact API. This is 
nontrivial mainly because of the immutahility of the data type. The companion paper 
[20] contains a description of the static analysis of Xact programs. 

We first, in Section 2, describe the design of the Xact language and motivate our 
design choices. Section 3 then gives a brief overview of the results from [20] about 
providing static guarantees for XML transformations written in Xact. Section 4 presents 
our runtime system and discusses time complexity of the operations. Finally, in Section 5, 
we evaluate the system by a number of experiments. 

Related work. The most closely related work is that on JDOM [17], XSLT [18], 
XQuery [4], XDuce [16], Xtatic [10], CDuce [2], XOBE [19], XJ [14], Xen [22], and 
HaXml [27]. In comparison, the Xact language is based on a combination of the fol- 
lowing ideas: 

- Xact integrates XML processing into a general-purpose language, rather than being 
a domain-specific language as XSLT or XQuery. 

- It applies a template-based paradigm for constructing XML values (reminiscent of 
that in XSLT but unlike the other systems mentioned above). 

- XML values are immutable (in stark contrast to JDOM, XJ, and Xen). 

- Deconstruction of XML values is based on the XPath language [8] (which is also 
used for similar purposes in XSLT, XQuery, XJ, and optionally also in JDOM). 

- Static guarantees are provided through data-flow analysis, thereby avoiding the 
explicit type annotations that are required in approaches based on type systems. Such 
explicit types can be cumbersome to write and read, and, as noted in [14], explicit 
types for XML values can be too rigid since the individual steps in a sequence 
of operations may temporarily invalidate the data unless permitting only bottom- 
up construction. (JDOM and XSLT provide no similar static guarantees, and the 
remaining alternatives mentioned above use type systems.) 

We refer to the paper [20] for a comprehensive survey of the relation between the 
language design of Xact and other systems. In the present paper, we focus on the 
relation to the runtime model of a few representative alternatives: (1) JDOM is generally 
considered an efficient but rather low-level platform for manipulating XML documents in 
Java. It provides an explicit tree representation of XML documents where nodes include 
parent pointers, which permits upwards traversal but prohibits sharing. (2) XSLT is a 
widely used XML transformation language and many implementations exist. A central 
part of XSLT is the use of XPath for selection and pattern matching, and much effort 
has been put into optimizing XPath processors for use in XSLT and other systems [12]. 
Our implementation of Xact uses an off-the-shelf XPath processor [21] and can hence 
benefit directly from such work. (3) Both Xtatic and CDuce inherit their key features — 
tree processing in a declarative style with regular types and patterns — from XDuce. 
Xtatic works in the context of C# whereas CDuce is a functional language. The paper [11] 
describes runtime representations for Xtatic, where the main challenges are immutability 
(as for Xact), efficient pattern matching (where we apply XPath instead), and DOM 
interoperability (using techniques that we could also apply). Since no implementation 
of Xtatic has been available to us, we choose the tuned implementation of CDuce as a 
representative for these systems for quantitative comparisons. 
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2 The Xact Language 

Compared to other XML transformation languages, Xact is designed to be a small 
sublanguage that can be described in just a few pages. The Xact language introduces 
XML transformation facilities into the Java programming language such that XML doc- 
uments, from a programmer’s perspective, are hrst-class values on equal terms with 
basic values, such as booleans, integers, and strings. Programmers can thereby combine 
the flexibility and power of a general-purpose programming language with the ability to 
express XML manipulations at a high level of abstraction. This combination is conve- 
nient for many typical transformation tasks. Examples are transformations that rely on 
communication with databases and complex transformation tasks, which may involve 
advanced control-flow depending on the document structure. In these cases, one can 
apply Xact operations while utilizing Java libraries, for example, the sorting facilities, 
string manipulations, and HTTP communication. We choose to build upon Java because 
it is widely used and a good representative for the capabilities of modern general-purpose 
programming languages. Additionally, it is often used as a foundation for Web services, 
using for example Servlets or SOAP, which involve dynamic construction of XHTML 
documents or manipulation of SOAP messages. 

We build XML documents from templates as known from the JWIG language [ 6 ] . 
This approach originates from MAWL [1] and <bigwig> [5], and was later refined in 
JWIG, where it has shown to be a powerful formalism for XHTML document con- 
struction in Web services. Our aim has been to extend the formalism to general XML 
transformations where both construction and deconstruction are supported. 

A template is a well-formed XML fragment containing named gaps: template gaps 
occur in place of elements, and attribute gaps occur in place of attributes. The core 
notation for templates is given by xml in the following grammar: 

xml := sir (character data) 

I <name atts>xml</name> (element) 

I < [g] > (template gap) 

I xml xml (template sequencing) 

atts := name=" value" (attribute) 

I name= [ 5 ] (attribute gap) 

I e (empty sequence) 

I atts atts (attribute sequencing) 

Here, str denotes a string of XML character data, name denotes a qualifled XML name, 
g denotes a gap name, and value denotes an XML attribute value. As an example, the 
following XML template, which can be useful when constructing XHTML documents, 
contains two template gaps named TITLE and MAIN and one attribute gap named COL: 

<html> 

<headxtitle>< [TITLE] ></titleX/head> 

<body bgcolor= [COL] X [MAIN] X/body> 

</htmI> 

Construction of a larger template from a smaller one is accomplished by plugging 
values into its gaps. The result is the template with all gaps of a given name replaced by 
values. This mechanism is flexible because complex templates can be built and reused 
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Table 1. The central methods in the XML class of Xact. 



static XML constant ( String s) 
String toStringO 
booiean equais (Object o) 
int hashCode ( ) 

XML piug(Gap g, XML x) 

XML piug(Gap g. String s) 

XML piug(Gap g, XML[] xs) 

XML piug(Gap g. String!] ss) 

XML [ ] select (XPath p) 

XML gapify (XPath p. Gap g) 

XML close 0 
XML cast (DTD d) 

XML analyze (DTD d) 

static XML smash (XML [] xs) 

static XML get (String s, DTD d) 



- creates a template from the constant string s 

- returns the textual rcprcscntadon of this template 

- determines equality of this template and o 

- returns the hash code of this template 

- inserts x into all g gaps in this template 

- as the previous operation, but for string 

- inserts the entries in xs into the g gaps in this template 

- as the previous operation, but for string entries 

- returns the array of subtemplates hit by p 

- replaces all subtemplates hit by p by g gaps 

- returns this template with all gaps removed 

- runtime check for validity 

- compile-lime check for validity 

- merges the entries of xs into a single template 

- creates a template from a non-constant string 



many times. Gaps can be plugged in any order; construction is not restricted to be 
bottom-up, in contrast to, for example, XDuce and XOBE. 

Deconstruction of XML data is also supported in Xact. An off-the-shelf language 
for addressing nodes within XML trees is available, namely W3C’s XPath language [8]. 
XPath is widely used and has despite its simplicity shown to be versatile in existing 
technologies, such as XSLT and XQuery. The Xact deconstruction mechanism is also 
based on XPath. We have identified two basic deconstruction operations, which are 
powerful in combination with plugging. The first is select, which returns the subtemplates 
addressed by an XPath expression. The second is gapify, which replaces the subtemplates 
addressed by an XPath expression with gaps. Select is convenient because it permits us 
to pick subtemplates for further processing. Gapify permits us to dynamically introduce 
gaps, which is important for a task such as performing minor modifications in an XML 
tree. Altogether, this constitute an algebra over templates, which allows typical XML 
manipulations to be expressed at a high level of abstraction. 

We have chosen a value-based programming model as in pure functional languages. 
In this model, XML templates are unchangeable values and operations have no side- 
effects. A Java class that implements the value-based model is said to be immutable. 
Such classes are favored because their instances are safe to share, value factories can 
safely return the same instances multiple times, and thread-safety is guaranteed [3]. All 
Java value classes, such as Integer and String, are for these reasons immutable. Our 
templates inherit the properties and benefit by being easier to use and less prone to error 
than mutable frameworks, such as JDOM. Furthermore, immutability is a necessity for 
useful analysis, as described in Section 3. 

The immutable Java class XML, which represents templates, has the methods shown 
in Table 1. All parameters of type Gap, XPath, and DTD are assumed to be constants and 
may be written as strings. The DTD parameters are URIs of DTDs. 

Xact distinguishes between two different sources of XML data: constant templates 
and input data. Constant templates are part of the transformation program and are con- 
structed using the constant method. The syntax for these templates is the one given by 
the grammar above. Input to the program is read using the get method, which constructs 
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a gap-free template from a non-constant string and checks the result for validity with 
respect to the given DTD schema. Output from the transformation is achieved through 
the toString method, which returns the string representation of the XML template. 

Templates can he combined by the plug method, which is overloaded to accept a 
template, a string, or arrays of these as second parameter. Invoking the non-array variants 
will plug the given string or template into all occurrences of the given gap name. The 
array variants will, in document order, plug all occurrences of the given gap name with 
entries from the given array. If the array has superfluous entries these will be ignored, 
and conversely, the empty string will be plugged into superfluous gaps. An exception is 
thrown if one attempts to plug a template into an attribute gap. 

Template deconstruction is provided by the select and gapify methods. Both 
methods take an XPath expression as parameter, which on evaluation returns a set of 
nodes within the given template' . Invoking the select method gives an array containing 
all the sub templates rooted at nodes in the XPath evaluation result. The gapify method 
returns a template where all subtemplates rooted at nodes in the XPath evaluation result 
have been replaced by gaps of the given name. 

The close method eliminates all gaps in a template and is commonly used in com- 
bination with toString. The result will by construction represent a well-formed XML 
document. Invoking the static smash method concatenates the entries of the given tem- 
plate array into a single template^. The equals method determines equality of XML 
instances, and the hashCode method returns a consistent hash code for an XML instance. 
The ability to compare entire XML templates for equality permits templates to be stored 
in containers as values rather than as objects and can also be useful in the decision logic 
of transformations. In comparison, other systems either do not have an equality primitive 
or compare by reference instead of by value. 

By placing special analyze methods in the code, the compile-time analyzer can be 
instructed to check for validity relative to the given DTDs. This is usually used in con- 
nection with the toString method to analyze validity of the output data. Additionally, 
runtime validation of a template according to a given DTD schema is provided by the 
cast method, which serves a similar purpose for the Xact analysis as the usual cast 
operation does for the type system of Java. 

In order to integrate Xact tightly with the Java language, we provide special syntax 
for template constants. This relieves programmers from tedious and error-prone charac- 
ter escaping. A template xml may be written [ {.xmV] ] , which after character escaping is 
equivalent to XML . constant ( "xml " ) . Transformations that use this syntax are desug- 
ared by a simple preprocessor. Also, a number of useful macros, presented in [20], for 
commonly occurring tasks are provided. For example, the delete macro effectively 
deletes the subtrees addressed by an XPath expression by performing a gapify opera- 
tion with a fresh gap name. Our implementation also contains a mechanism for declaring 
XML namespaces for constant templates and XPath expressions. 



* All XPath axes are supported by Xact. Although the paper [20] focuses on the downwards 
axes, the program analyzer is capable of handling all axes. 

^ The paper [20] describes a more powerful operation group and defines smash as syntactic 
sugar. We now treat smash as the primitive and express group in terms of smash, select, and 
equals instead. 
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Example. We now consider a simple example, originating from [15], where an address 
book is filtered in order to produce a phone list. An address book here consists of an 
addrbook root element that contains a sequence of person elements, each having 
a name, an addr, and an optional tel element as children. The filtration outputs a 
phonelist root element that contains a sequence of person elements, where only 
those having a tel child remains, and with all addr elements eliminated. The following 
method shows how this can be implemented with Xact: 

XML phonelist (XML book) { 

XML[] persons = book, select ("/addrbook/person[tel] ") ; 

XML list = XML. smash(persons) . delete ("//addr") ; 

return [ [<phonelist>< [LIST] ></phonelist>] ] .plug ("LIST" ,Iist) ; 

} 

We use the select operation to build an array of all person elements that have a tel 
child. Then, the array entries are combined into a single template, all addr elements are 
deleted, and the result is wrapped into a phonelist element^. 

One may additionally wish to sort the phone list alphabetically by name. Java 
has built-in sorting facilities for arrays, so this is accomplished by implementing a 
Comparator class, called PersonComparator, with the following compare method: 

int compare(Object ol, Object o2) { 

XML xl = (XML)ol, x2 = (XML)o2; 

String si = XML . smash(xl . select ("/person/name/text ()")). toStringO ; 
String s2 = XML . smash(x2 . select ("/person/name/text ()")). toStringO ; 
return si . compareTo(s2) ; 

} 

The Xact operations here simply extract the character data to be used in the comparison. 
The phone list can then be sorted by inserting the following line into the phonelist 
method (after the select operation): 

Arrays . sort (persons , new PersonComparator ()) ; 

The example shows how Xact integrates XML processing into Java and how a nontrivial 
transformation task can be intuitive to express using Xact. More example programs can 
be found at http : //www . br ics . dk/Xact/. 



3 Static Guarantees 

Transforming data from one XML language to another can be a quite intricate task, 
even if a high-level programming language is being used. In particular, it can be dif- 
ficult to ensure at compile-time that the output is always valid with respect to a given 
DTD schema. A special property of the design of Xact is that it enables precise static 
analysis for guaranteeing absence of certain programming errors related to XML docu- 
ment manipulation. In the companion paper [20], we present a data-flow analysis that, 
at compile-time, checks the following correctness properties of an Xact program: 

^ With the notion of code gaps, which is included in the syntactic sugar mentioned in [20], the 
last operation can be written more concisely: 
return [ [<phonelistX{list JX/phonelist>] ] ; 
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output validity — that each analyze operation is valid in the sense that the XML 
template at this point is guaranteed to be valid relative to the DTD schema; and 
plug consistency — that each plug operation is guaranteed to succeed, that is, tem- 
plates are never plugged into attribute gaps. 

Additionally, the analysis can detect and warn the programmer if the specified gap for 
a plug operation is never present and if an XPath expression in a select or gapify 
operation will never address any nodes. 

Notice that Xact, in contrast to other XML transformation systems that permit static 
guarantees, does not require every XML variable to be explicitly typed with schema 
information. 

The crucial property of Xact that makes the analysis feasible is that the XML 
templates are immutable. Analyzing programs that manipulate mutable data structures 
is known to be difficult [24,23], and the absence of side-effects means that we do not 
have to model the complex aliasing relations that otherwise may arise. 

The analysis is conservative in the sense that it never misses an error, but it might 
report false errors. Our experiments in [20] indicate that the analysis is both precise and 
efficient enough to be practically useful, and that it produces helpful error messages if 
potential errors are detected. 

4 Runtime System 

We have now presented a high-level language for expressing XML transformations 
and briefly explained that the design permits precise static analysis. However, such a 
framework would be of little practical value if the operations could not be performed 
efficiently at runtime. In this section, we present a data structure in the form of a Java 
library addressing this issue. 

To qualify as a suitable representation for XML templates in the Xact framework, 
our data structure must support the following operations: 

- Creation: Given the textual representation of an XML template, we must build the 
structure representing the template. 

- Combination: The plug, close, and smash operations operate directly on XML 
templates and must be supported directly by the data structure. 

- Navigation: The tasks of converting a template to its textual representation, checking 
the template for validity according to a given schema, and evaluating an XPath 
expression on a template all require means for traversing the XML data in various 
ways. In general, we need a mechanism for pointing at a specific node in the XML 
tree. We call such an XML pointer a navigator. It must support operations for moving 
this pointer around the tree. To support all XPath axis evaluations, we must be able 
to move to the first child and first attribute of an element node, the parent and 
next/previous sibling of any tree node, and the next/previous attribute of an attribute 
node. We assume that this is sufficient for the XPath engine being used (for example, 
Jaxen [21] satisfies this). 

- Extraction: The result of evaluating an XPath expression on the structure, using its 
navigation mechanism, is a set of navigators. From this set of navigators, we must 
be able to obtain the result of the select and gapify operations. 
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Fig. 1. The effect of performing the non-array plug operation, c = a. plug (g , b) . Part (i) shows 
the two templates, a and b, where a contains two g gaps. Part (ii) shows the naive approach 
for representing c, where everything has been copied. Part (iii) shows the basic approach from 
Section 4.1 where only the paths in a that lead to g gaps are copied and new edges are added 
pointing to the root of b. Part (iv) shows the lazy approach from Section 4.2 where a plug node is 
generated for recording the fact that b has been plugged into the g gaps of a. When the structure 
in (iv) is later normalized, the one in (iii) is obtained. 

A naive data structure that trivially supports all of these operations is an explicit XML 
tree with nextsibling, previous-sibling, first-child and parent pointers in all nodes 
(where we encode attributes in the contents sequences). If such a data structure is used, 
we are forced to copy all parts of the operand structures that constitute parts of the result 
in order to adhere to the immutability constraint. The doubly-linked nature of the struc- 
ture prohibits any sharing between individual XML values. The running times for the 
Xact operations on such a structure would thus be at least linear in the size of the result 
for each operation. As we show in the following, we can do better using a specialized 
data structure. 



4.1 The Basic Approach 

The main problem with the doubly-linked tree structure is that it prevents sharing between 
templates. To enable sharing, we use a singly-linked binary tree, that is, a tree with only 
first-child and nextsibling pointers but without the parent and previous sibling pointers. 
This structure permits sharing as follows: whenever a subtree of an operand occurs as a 
subtree of the result, the corresponding pointer in the result simply points to the original 
operand subtree and thus avoids copying that subtree. 

Recall that, unlike complete XML documents, an XML template does not necessarily 
have a single root element; rather, it can have an arbitrary sequence of elements, character 
data and template gaps, which we will refer to as the top-level nodes. 

To perform a non-array plug operation, a . plug((/ , 6) , we copy just the portion of 
a that is not part of a subtree that will occur unmodified in the result. More precisely, 
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this is the tree consisting of the paths from the root of a to all g gaps in a. Any pointer 
that branches out of these paths in the result points back to the corresponding subtree 
of a. If the gap has a next sibling, we will also need to copy the top-level nodes of b, 
since the list of successors for these nodes changes. This representation is depicted in 
Part (iii) of Figure 1. Note that, in general, this operation will create a DAG rather than a 
tree, since multiple occurrences of (/ in a will result in multiple pointers from the result 
to the root of b. The array plug operation is performed similarly, except that the path 
end pointers point to distinct templates. The close operation duplicates the paths to all 
gaps and removes the gaps from the duplicate. 

To be able to find the paths to the g gaps efficiently, we must have additional in- 
formation in the graph. In each node, we keep a record of which gap names occur in 
the subtree represented by that node. Since typical templates contain only few distinct 
gap names, this gap presence information can often be shared between many nodes 
and will not constitute a large overhead. Combining this information when constructing 
new templates is also straightforward. Now, when a plug operation into g traverses the 
graph looking for g gaps, it simply skips any branch where the gap presence information 
indicates that no g gaps exist. This narrows the search down to the paths from the root to 
the g gaps. Thus, the execution time for a plug operation is proportional to the number 
of nodes that are ancestors of g gaps in a (including preceding siblings because of our 
use of first .child and next .sibling pointers), plus the number of top-level nodes in b times 
the number of g gaps in a. For the array plug operation, the last term simply becomes 
the total number of top-level nodes in the plugged templates. 

Constructing the representation of a tree from its textual representation using the 
constant operation takes time proportional to the size of the tree plus, for each node, 
the number of different gap names that appear in its subtree. The time for converting a 
template to text using toString is proportional to the template size. 

Navigation in this structure is not as straightforward as in the doubly-linked case, 
since navigating backwards with parent or previous sibling requires information that 
is not available in the tree. We can support these directions by letting the navigators 
remember the entire path back to the root, and then backtrack along this path whenever a 
backward step is requested. In other words, we let the navigators contain all the backward 
pointers that the XML structure itself omits. Since navigators are always specific to one 
XML value, we do not restrict sharing by keeping these pointers while the navigator is 
used. Taking any navigator step still takes constant time. 

The select operation simply returns a set of pointers to the nodes pointed to by the 
navigators that result from the XPath evaluation. Only the nodes pointed to are copied 
to make sure that their next .sibling pointers are empty. The total time for performing the 
select operation is proportional to the XPath evaluation time. The gapify operation 
first evaluates the XPath expression, resulting in a set of navigators that represent the 
addressed nodes. The tree from the root to these nodes is then copied, as for the plug 
operation. After that, the gap information in the nodes of the new tree is updated in a 
bottom-up traversal to include the new gaps. The total time for performing the gapify 
operation is proportional to the XPath evaluation time, which dominates the time for the 
other steps. Finally, the smash operation can be simulated by making a sequence of gap 
nodes with a fresh gap name and performing an array plug operation into these. This 
takes time proportional to the total number of top-level nodes in the smashed templates. 
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These figures may seem satisfactory; however, it turns out that this approach has 
some drawbacks as the following observations reveal. 

- In a sequence of plug operations, each individual plug may create many nodes that 
will be replaced in a subsequent plug operation. If the intermediate results are not 
needed except as arguments for the subsequent plug operations (which is usually 
the case), constructing these nodes is unnecessary and wasteful. For example, a 
common idiom is to use a template <li>< [item] ></li>< [more] > to build a list 
of li elements by repeatedly plugging the template itself into the more gap. Such 
a construction would take quadratic time in the length of the constructed list, since 
all preceding siblings need to be copied each time. 

- Traversing the structure recursively when looking for gaps can lead to unwieldy 
stack sizes, since the ancestor nodes of the gaps include all preceding siblings. 
This problem clearly shows up in practice — the algorithm is unable to handle XML 
documents with more than a few thousand mutual siblings. 

These observations lead us to a further refinement, as explained in the following section. 

4.2 A Lazy Data Structure 

We now present a modification of the basic structure that allows the operations to be 
performed lazily without any reconstruction taking place until explicit traversal of the 
tree is required. This effectively groups plug operations together in a way that permits 
list structures to be built in linear time. 

To accomplish this, we introduce special operation nodes in the graph, each rep- 
resenting a plug or close operation (with smash being simulated by array plug as 
before). We call all other nodes concrete nodes. An operation node has one designated 
child node, which represents the this operand. There are three variants of operation 
nodes, corresponding to the two variants of the plug operation and the close oper- 
ation, respectively; the non-array plug node is labeled with a gap name and has one 
extra edge corresponding to the value being plugged in; similarly, the array plug node 
is labeled with a gap name and an array of extra edges; and the close node has no extra 
information. Intuitively, an operation node merely records the fact that a plug or close 
operation has occurred without actually performing it. Part (iv) of Figure 1 illustrates 
this lazy variant of the plug operation. 

As long as only plug, close and smash operations are performed, the resulting 
template will be represented by a DAG of operation nodes and concrete nodes, where 
all ancestors of operation nodes are themselves operation nodes. When the actual tree 
is needed, we need to unfold this structure into a DAG of only concrete nodes, so that 
the previously described navigation mechanism can be used. We refer to this process of 
eliminating operation nodes as normalization. 

To perform normalization, we traverse the DAG depth-first while keeping track of 
the current plug context. A plug context is a map from gap names to nodes, defined by a 
list of operation nodes. The current plug context is always defined by the list of ancestor 
operation nodes of the current node. A non-array plug node maps its corresponding 
gap name to the root of the plugged template; an array plug node initially maps its gap 
name to the root of the first plugged template; and a close node maps all gap names to a 
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special value remove. When more than one operation node mapping from the same gap 
name exist among the ancestors, the one furthest down in the DAG has precedence as it 
corresponds to an earlier plug operation. 

Whenever we encounter a gap node with a name for which there is a mapping in the 
current plug context, we recursively traverse the template rooted at the node targeted 
by the mapping — or remove the gap node in case of the value remove. If the operation 
node is an array plug node, its mapping is changed to the next node on its list, or to the 
empty template if the list is exhausted. Note that the plug context in this traversal of the 
plugged template is defined by the operation node ancestors in the complete DAG, which 
includes the ones in the plugged template plus the ones created after it was plugged. 

Just as for the plug operation in the basic approach, we only traverse the part of 
the DAG which actually needs to be duplicated. That is, we skip any branch where the 
gap presence information indicates that no gap exist for which there is a mapping in 
the current plug context. The only exception to this is the top-level nodes of plugged 
templates, which in general need to be duplicated, as before. 

Following this strategy, we essentially perform a single traversal of the part of the final 
resulf which could not be shared with any of the constituent templates. Thus, assuming 
that the context lookup operations can be performed in amortized constant time (which 
can be accomplished by caching lookups), and assuming a constant number of distinct 
gap names, the running time for the entire normalization process is proportional to the 
number of newly created nodes. Since this is bounded by the size of the result and the 
result is typically traversed completely anyway, this is a satisfactory result. 

To alleviate the stack requirement of the traversal, we use pointer reversal [25] , which 
in essence uses the newly generated nodes as an explicit recursion stack. The recursion 
involved in the recursive unfolding of plugged templates mentioned above is done using 
a separate, explicit stack. Thus, with this strategy, the call stack usage is bounded by a 
constant, and the overall memory requirements are significantly reduced, compared to 
the purely recursive approach. 



4.3 Java Issues 

One of the prominent features of immutable data manipulation is that it works fluently 
in a multi-threaded environment. For this to work properly in the Java implementation, 
care must be taken when the internal state of a representation changes. This happens 
when the result of a normalization replaces the operation nodes — and this is of course 
properly synchronized in the implementation so that no thread will see the data struc- 
ture in an inconsistent state, and no two threads will perform the same normalization 
simultaneously. Note also that the pointer reversal only changes newly created nodes, so 
another thread can traverse (and even normalize) a template sharing parts with the one 
being normalized without causing any problems. 

A ubiquitous Java feature is the ability to compare objects using the equals method. 
This is easily (albeit not very efficiently) done for XML templates by a simple, parallel, 
recursive traversal. To conform to the Java guidelines, any implementation of equals 
must be consistent with the corresponding implementation of the hashCode method. 
To provide this consistency, each node includes a hash code representing the XML tree 
rooted at the node (including following siblings). The hash code for the entire template 
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is then the hash code of the leftmost top-level node. This also enables a more efficient 
implementation of equals: whenever two compared subtemplates have different hash 
codes, their equality can be rejected right away. Furthermore, whenever two subtemplates 
originate from the same original subtemplate unmodified, their object identity verifies 
their equality. 

5 Evaluation 

This section describes experiments with our prototype implementation of the Xact 
runtime system. The main goal is to gather runtime performance measurements for a 
range of typical XML transformations in order to compare the performance of Xact 
with that of related systems. Due to the limited space, we can only provide a brief report 
on our evaluation results. 

We have collected a suite of benchmark programs, most of which are inspired by 
XML transformations developed in other languages. A few programs have been devel- 
oped to specihcally test the worst-case behavior of our implementation. Altogether the 
suite covers a broad spectrum of typical XML transformation tasks. 

Most of the related technologies mentioned in Section 1 are currently being devel- 
oped by other research teams. Unfortunately, only a few have wished to provide an 
implementation, making it impossible to do a complete performance comparison of all 
the systems. Instead we have picked JDOM, XSLT and CDuce — for which optimized 
runtime systems are available — as good representatives for the different approaches. 
The JDOM and CDuce measurements are obtained using the latest releases (JDOM 
Beta 10, and CDuce 0.1.1.) The XSLT measurements are obtained using Apache Xalan 
2.6, which supports the complete XSLT 1.0 language and is among the fastest Java- 
based implementations. For Xact, we use the lazy approach described in Section 4.2. 
All experiments have been executed on an 3.0 GHz Intel Pentium 4 machine with 1 GB 
RAM running Red Hat Linux 9.0 with Sun’s Java 2 SE 1.4.2 and O’Caml 3.0.7. Since 
the focus of this paper is runtime performance we do not measure compilation and type 
checking. Furthermore, the price of parsing input XML documents says little about the 
relative strengths of the implementations, so this cost is excluded from measurements 
in order to give a fair comparison. 

We start by comparing Xact with XSLT using four typical XML transforma- 
tion tasks. Two transformations originate from the XSLTMark benchmark suite [9]: 
Backwards mirrors its input document by reversing the order of all node sequences; 
DBOnerow queries a person database for a single entry and transforms it into XHTML. 
Performance on mixed content documents is compared by Uppercase, which transforms 
all names in an address book into uppercase characters. Phonelist is the example from 
Section 2 transforming an address book into a sorted phone list. The transformations are 
executed on input XML documents of size 100 KB, 1 MB, and 10 MB. 





100 KB 


1MB 


10 MB 1 




XSLT 


Xact 


XSLT 


Xact 


XSLT 


Xact 


Backwards 


551 ms 


421 ms 


1,615 ms 


1,513 ms 


15,373 ms 


11,599 ms 


DBOnerow 


279 ms 


160 ms 


754 ms 


274 ms 


4,048 ms 


994 ms 


Uppercase 


431 ms 


246 ms 


1,234 ms 


634 ms 


8,810 ms 


5,365 ms 


Phonelist 


494 ms 


423 ms 


1,351 ms 


1,799 ms 


8,029 ms 


21,834 ms 
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These figures indicate that the performances of the two are roughly similar. The main 
benefits of Xact compared to XSLT are the static guarantees and the possibility of apply- 
ing the full Java language. For example, the Uppercase benchmark is only expressible 
in XSLT because this language contains a built-in function (translate) for mapping 
individual characters to other characters; more advanced character data transformations 
are not possible in XSLT without implementation dependent extension functions. 

Next, we compare Xact with JDOM using the Linkset transformation (Example 
15.8 in [13]), which extracts a set of links from an RDF feed, and the Phonelist 
transformation, which is described above. 





100 KB 


1MB 


10 MB 1 




JDOM 


Xact 


JDOM 


Xact 


JDOM 


Xact 


Linkset 


23 ms 


146 ms 


128 ms 


316 ms 


304 ms 


1,837 ms 


Phonelist 


80 ms 


422 ms 


408 ms 


1,799 ms 


3,212 ms 


21,834 ms 



These experiments indicate that the JDOM approach with mutable tree updates and 
purely navigational access, as one would expect, performs better than the immutable 
Xact approach based on XPath. However, this should be contrasted by the fact that the 
the Xact transformations are both shorter and more readable than the JDOM transfor- 
mations. Furthermore, the Xact transformations are statically type safe in contrast to 
those written with JDOM. 

For the comparison of Xact and CDuce we use our Phonelist transformation and 
the Split transformation, which is a benchmark program developed by the CDuce team 
and used in their performance comparisons. 





100 KB 


1MB 


10 MB 1 




CDuce 


Xact 


CDuce 


Xact 


CDuce 


Xact 


Phonelist 


156 ms 


422 ms 


1,747 ms 


1,799 ms 


21,579 ms 


21,834 ms 


Split 


94 ms 


496 ms 


496 ms 


1,729 ms 


error 


12,897 ms 



Since Xact uses Java and CDuce uses O’Caml, the performance is difficult to compare"*, 
but on these few benchmarks there seems to be no significant time difference for larger 
data sets. When running the CDuce Split transformation on the 10MB document, it 
runs out of memory, indicating that that the internal XML representation in Xact is 
more compact than the one in CDuce. 

To demonstrate that the lazy approach is preferable to the basic one, we compare 
the two using a benchmark Logging, which extracts statistical information from a web 
server log file and exhibits the quadratic blowup for the basic approach: 





Xact (basic) 


Xact (lazy) 


Logging (100 KB) 


709 ms 


639 ms 


Logging (1 MB) 


3,189 ms 


1,926 ms 


Logging (3 MB) 


11,227 ms 


3,836 ms 


Logging (10 MB) 


stack overflow 


9,011 ms 



These figures show that the lazy approach can lead to significant saving in practice and 
how it scales smoothly to large documents. 

* To exclude parsing time for CDuce, we measured the full time including parsing and then 
subtracted the time for performing the identity transformation. 
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In general, we conclude that the runtime system is sufficiently efficient. Our goal 
has not been to outperform the alternative XML transformation systems, but rather to be 
comparable in runtime performance and scalability, which complements the convenient 
language design and static analysis that Xact also provides. 

Obviously, there are ways to improve performance further. We plan to experiment 
with caching of XPath parse trees, handling simple XPath expressions without involving 
the general XPath engine, and compiling XPath expression to basic navigation steps (as 
also done in the XJ project). Also, we believe that it is possible to exploit the knowledge 
gained from the static analysis for optimizing the evaluation of XPath expressions. 



6 Conclusion 

We have presented an overview of the Xact language, focusing on the runtime system. 
The design of Xact provides high-level primitives for programming XML transforma- 
tions in the context of a general-purpose language, and, as shown in [20], it permits a 
precise static analysis. A special feature of the design is that the data-type is immutable, 
which at the same time is convenient to the programmer and a necessity for precise anal- 
ysis. However, it also makes it nontrivial to construct a runtime system that efficiently 
supports all the Xact operations, which is the main problem being addressed in this 
paper. Our experiments indicate that the runtime system being proposed is sufficiently 
efficient to be practically useful. 

Our prototype implementation, which consists of the runtime system, the desugarer, 
and the static analyzer supporting the full Java language, is available on the Xact home 
page: http://www.brics.dk/Xact/. 
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Abstract. Due to their numerous benefits, relational systems play a major role 
in storing XML documents. XML also benefits relational systems by providing 
a means to publish legacy relational data. Consequently, a large volume of XML 
data is stored in and produced from relations. However, relational systems are 
not well-tuned to produce XML data efficiently. This is mainly due to the flat 
nature of relational data as opposed to the tree structure of XML documents. 
In this paper, we argue that relational query optimizers need to incorporate new 
optimization techniques that are better suited for XML. In particular, we explore 
new optimization techniques that enable computation sharing between queries 
that construct sibling elements in the XML tree. Such queries often have large 
common join expressions that can be shared through appropriate rewritings. We 
show experimentally that these rewritings are fundamental when building XML 
documents from relations. 



1 Introduction 

Relational systems are good for XML and XML is good for relations. On the one hand, 
there are mature relational systems that can he used to store the ever growing number of 
XML documents that are being created. On the other hand, XML is a great interface to 
publish and exchange legacy relational data. Consequently, most XML data today comes 
from and ends up in relations. Unfortunately, XML and relational data differ in their very 
nature. One is tree- structured, the other is flat and often normalized. Consequently, the 
performance of relational engines varies considerably when producing XML documents 
from relations. In research, several efforts explored how to build XML documents from 
relations efficiently [2][4][5] [7][9], In particular, the authors in [7] argue for extending 
relational engines to benefit XML queries. In this work, we explore complementary 
optimization techniques that are fundamental to handle XML queries efficiently in a 
relational engine. 

Due to the flat nature of relational data, as opposed to the nested structure of XML, 
generating an XML document from relations often involves evaluating multiple SQL 
queries (possibly as many as the number of nodes in the DTD). These queries often 
contain common sub-expressions in order to build the tree structure. Thus, query per- 
formance can vary considerably, necessitating a cost-based optimization of the plan for 
building XML documents. In [9], the authors explore rewriting-based optimizations be- 
tween a query for a parent node and the queries for its children nodes in a middle-ware 
environment. We argue that sharing computation between queries for sibling nodes, not 
just between parent and children queries, is key to efficiently building XML documents 
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from relational data, both in a middle-ware environment (as in [9]) and for the query 
optimizer of a relational system. 

Consider a publishing example where we want to build XML documents from a 
TPC-H database [16]. These documents conform to a DTD with Customer as a root 
element and its two sub-elements SuppName (for suppliers) and PartNamie (for parts). In 
order to identify the suppliers of a given customer, the TPC-H table, CUSTOMER, is joined 
with ORDERS and LINEITEM. This join expression needs to be joined with the SUPPLIER 
table to compute supplier names (i.e., element SuppName). This same join expression 
(between CUSTOMER, ORDERS and LINEITEM) needs to be joined with the PART table to 
evaluate the set of part names associated with each customer (i.e., PartName). Obviously, 
the sibling queries at the two elements SuppName and PartNamie share a large common 
join expression and could be merged into a single query (using an outer union) where 
this common join expression is factored out, enabling the relational engine to evaluate 
it only once. In [9], every such query merge has to go through a parent/child merge. For 
example, the two sibling queries at SuppNaime and PartName can be merged only if they 
are also joined with the query at Customer, resulting in a single query that is used to 
evaluate the whole document. If Customer is a “fat” node (containing multiple attributes 
such as Name, Zip and Phone), the entire customer information will be replicated with 
each SuppName and each PartName (because of outer-joins) which may result in higher 
communication costs (in a middle-ware environment) and computation costs (both in a 
middle- ware environment, and in a relational optimizer). Being able to merge sibling 
queries independently from their parent query and factor out common sub-expressions 
in merged queries is a key rewriting that we will explore when building XML documents 
from relations. 

Our contributions are as follows: 

- We show that sharing computation between sibling queries is a fundamental opti- 
mization for efficient building of XML documents from flat relational data. 

- We describe several query rewritings that exploit shared computation between 
queries used to build an XML document. 

- We design an optimization algorithm that applies our rewritings to find the best set 
of SQL queries that optimizes processing time and achieves a good compromise 
between processing and communication times in a middle-ware environment. 

- We run experiments that compare multiple strategies of sharing common computa- 
tion between sibling queries and identify their considerable performance benefits. 

Section 2 describes our motivating examples and gives a formal definition of our 
problem. Rewriting techniques are presented in Section 3. Section 4 contains the opti- 
mization algorithm and a study of the search space. Experimental results are presented 
in Section 5. Related work is discussed in Section 6. Section 7 concludes. 

2 Motivation and Problem Definition 

2.1 Publishing Legacy Data in XML 

We consider a simplihed version of the relational schema of the TPC-H benchmark [16]. 
This schema describes parts ordered by customers and provided by suppliers. 
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TPC 

* 

Customer 




Name Phone AcctBal Region PartName SuppName 




RegName RegCom 



Fig. 1. Publishing Legacy Data: Example DTD 



CUSTOMER [C.CUSTKEY , C_NAME , C_PH0NE , C_ACCTBAL , C_NATIDNKEY] 

NATION [N_NATI ONKEY , N_NAME , N_REGI ONKEY] 

REGION [RJIEGIONKEY , R_NAME , R.COMMENT] 

PART [P_PARTKEY , P_NAME] 

SUPPLIER [S_SUPPKEY , S_NAME , S_NATIONKEY] 

ORDERS [O.ORDERKEY , O.CUSTKEY] 

LINEITEM [L.ORDERKEY , L_PARTKEY , L.SUPPKEY] 

We want to build XML documents that conform to the DTD given in Fig. 1 . To keep 
the exposition simple, we represent this DTD as a tree. Edges labeled with a are 
used for repeated sub-elements. We are interested only in publishing information about 
customers whose account balance is lower than $5000 (predicate p). PartName contains 
the names of the parts ordered by a customer. SuppName contains the names of their 
suppliers. 

It has been shown previously in [5] that it is possible to write a single SQL query 
(containing outer-joins and possibly outer-unions) to build an XML document for a 
relational database regardless of the underlying relational schema. It has also been shown 
in [9] that a set of SQL queries that are equivalent to the single query can be generated. In 
order to better explain performance issues, we examine the case where an SQL query is 
generated for each element in the DTD as follows (where C = ap (CUSTOMER) , J1 = 
C I^NATIDNKEY NATION I^REGIONKEY REGION and J2 = C I^CUSTKEY ORDERS b<lQfij)£ft[{EY 
LINEITEM): 

Qcustomer = 7!"C_CUSTKEY (Q) 

Qn ame — TTc.CUSTKEY.CJAME (O') 

Qph one — 7 Tc_CUSTKEY,C_PH0NE (Q) 

QAcctBal = 7''c_CUSTKEY,CJICCTBAL (C) 

^Region '^C_CUSTKEY,R_REGIONKEY (Ji) 

^RegName = 7 Tc_CUSTKEY,R_REGI0NKEY,R_NAMe(>^1) 

QRegCom = 7!"C_CUSTKEY,R_REGI0NKEY,R_C0MMENT (>/l) 

QpartName = 7 !"C_CUSTKEY,P_NAMe(>/ 2 XIpARTKEY PART) 
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QsuppName — 7tc_CUSTKEY,S_NAME(>/2 XlsUPPKEY SUPPLIER) 

Several sibling queries share common expressions. The simplest example is the 
case of QName, Qphone and QAcctBai that are all projections on the same (subset of the) 
CUSTOMER table. This is due to the fact that some helds in this table are used to generate 
sub-elements. Thus, these sibling queries could be merged into a single query Qc- The 
merged query could further be merged with the parent query, Qcustomer, resulting in “fat” 
customer nodes as follows: 

Qc = 7’"c_custkey,c_name,c_phone,c_acctbal(C') 

The second example involves QaegMame and QRegcom that share a common join ex- 
pression Jl. This is due to the fact that nations and regions are normalized into tables 
and that recovering them requires performing joins with these intermediate tables. By 
merging QaegName and Qaegcom into a single query, the common expression is evaluated 
only once: 



QaegNameCom — 7>'C_CUSTKEY,a_aEGI0NKEY,a_NAME,a_C0MMENT(Tl) 

Furthermore, QaegNameCom could he merged with Qc resulting in: 

7’‘C_CUSTKEY,C_NAME,C_PH0NE,C_ACCTBAL,a_aEGI0NKEY,a_NAME,a_C0MMENT(Tl) 

The last and most interesting example is the case of the two sibling queries QpartName 
and QsuppName that share a common join expression J2. In order to share this join ex- 
pression, the two queries could be merged. However, due to the fact that a customer has 
multiple parts and suppliers, merging QpartName and QsuppWame might result in replication. 
Replication might slow down query processing as well as communication time. There 
are two ways to avoid replication in this case. Either the relational optimizer is able 
to optimize outer unions and the queries are rewritten using an outer-union where the 
common sub-expression is factored out. Or, the relational engine is forced to compute 
J2, materialize it and then use it to evaluate the two queries. The query Qpartsupp that 
results from the outer union is given by: 

(7rc_CUSTKEY,PJAME,llULL(T2 tXlpAaxKET PART)) U (7!'C-CaSTKET,mL,SJIAME(T2 NguppicEy SUPPLIER)) 

Because of the presence/ahsence of some indices, it might not always he the case 
that the merged query, Qpartsupp is cheaper than the sum of QpartName and QsuppName • The 
relational optimizer could choose different plans to evaluate the common join expression 
in the two queries resulting in a better evaluation time than the merged query. 

Finally, using an outer-join, it is possible to rewrite all of the nine queries above 
into a single query. That outer-join guarantees that all customers satisfying predicate 
p will be selected (even if they have never ordered any part). However, if all queries 
are merged, each customer tuple (along with its required fields) will be replicated as 
many times as the number of parts and suppliers for this customer. This replication is 
due to merging queries with their parent query and impacts computation time as well 
as query result size. Therefore, depending on the amount of replication generated by 




162 



S. Amer-Yahia, Y. Kotidis, and D. Srivastava 




Fig. 2. Shredded XML: Example DTD 



merging sibling queries with their parent query, it might be desirable to optimize queries 
at siblings separately from their parent query. 

A key observation is that shared computation between sibling queries, when build- 
ing XML data from relations, is often higher than shared computation between parent 
and children queries making sibling merges more appropriate than merging with the 
parent query. This is particularly true in two cases: (i) if the relational schema is highly 
normalized and thus, several joins involving intermediate relations are needed to com- 
pute sub-elements, and (ii) if the DTD contains repeated sub-elements that might create 
replication (of the parent node) when children and parent queries are merged together. 

2.2 Building XML Documents from Shredded XML 

There have been many efforts to explore different shredding schemas for XML into rela- 
tions [1]. Most of them rely on using key/foreign key relationships to capture document 
structure and thus, generate sibling queries with common sub-expressions when building 
XML documents. 

Fig. 2 contains the DTD of a document that has been shredded and stored in a 
relational database. The example contains information on shows such as their title and 
year as well as the reviews written by a reviewer who has a name and an address. We 
first consider the relational schema Schemal given below: 

SHOW [SHOW.ID , TITLE , YEAR] 

REVIEW [REVIEW.ID , COMMENT , SHOW.ID] 

REVIEWER [REVIEWER_ID , NAME , ADDRESS_ID , REVIEW_ID] 

ADDRESS [ADDRESS_ID , ADDRESS] 

In order to compute the Name and Address of each reviewer in a review associated 
with a show, the following two queries are needed (where J = REVIEW XIreview_id 
REVIEWER): 

QName = 7rSH0W_ID,REVIEW_ID.REVIEWER_ID,NAME(>/) 

^Address = rTsHOW_ID,REVIEW_ID,REVIEWER_ID,ADDRESS (<F I^ADDRESS_ID ADDRESS) 
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These queries share a common sub-expression J. If the address of a reviewer was 
inlined inside a review in the relational schema, no redundant computation would have 
occurred. This is illustrated in Schema2 as follows: 

SHOW [SHOW.ID , TITLE , YEAR] 

REVIEW [REVIEW.ID , COMMENT , NAME , ADDRESS , SHOW.ID] 

In this case, Q^tams and (^Address correspond to simple projections on the REVIEW 
table and could be easily merged into a single query. 

Even when a query workload is used for XML storage in relations, as in [3], our 
optimization techniques for SQL “publishing” queries are still useful for two reasons. 
First, an improved optimizer can provide more accurate cost estimates for the queries in 
the workload, for making cost-based shredding decisions. Second, queries that are not 
in the initial query workload would still need to be optimized in an ad-hoc fashion. 

2.3 Problem Definition 

We are interested in building XML documents from a relational store efficiently. Fol- 
lowing the approaches used in [4] [9], the structure of the resulting XML documents is 
abstracted as a DTD-like labeled tree structure, with element tags serving as node labels, 
and edge labels indicating the multiplicity of a child element under a parent element; it 
is important to note that multiple nodes in this DTD-like structure may have the same 
element tag, due to element sharing, or due to data recursion. 

SQL queries are associated with nodes in this DTD-like structure, and together 
determine the structure and content of the resulting XML documents that are built from 
the relational store; the examples in Section 2.1 are illustrative. The set of SQL queries 
generated for a given XML document might be small. However, the amount of data that 
these queries manipulate can be large and thus, optimization is necessary. If we denote 
by S the set of all SQL queries necessary to build a document conforming to a given 
DTD, then the problem is formulated as follows: 

find the set of SQL queries S such that Ss^s{wp * proc{s) + Wc* comm(s)) 
is minimized 

where proc (resp. comm) is a function that computes the cost of processing (resp. 
communicating the result of) a SQL query and Wp, Wc are weights chosen appropriately. 
In a centralized environment, communication cost might not be relevant in which case, 
it could be removed from the cost model. 

3 Queries and Rewriting Rules 

We explore several possible rewritings that share common computation between queries 
and use the relational optimizer as an oracle to optimize and estimate the cost of individual 
SQL queries. This technique can be used both by a middle-ware environment (as in [9]) 
and to extend a relational optimizer to be able to perform the optimizations we are 
considering. 
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3.1 Query Definition 

In order to better explain the rewriting rules we are using, we first define the query 
expressions used to build a single XML document. Sorting is omitted in our queries 
since it does not affect our rewritings. 

Definition 1 [Atomic Node] An atomic node is a node in the DTD which has a unique 
instance for each distinct instance of its parent node. 

By convention, the root of the XML document is an atomic node. Examples of atomic 
nodes are RegName and Name (in the DTD of Fig. 1). Each customer has a single value for 
these nodes. Thus, given a node in the tree, there exists a functional dependency between 
each instance of that node and each corresponding instance of its atomic children nodes. 



Definition 2 [Multiple Node] A multiple node corresponds to a node in the DTD which 
may have multiple instances for each distinct instance of its parent node. 

An example of a multiple node is the PartName node. 

In order to compute instances of an XML node, a unique SQL query is associated 
with that node. The evaluation of that query results in a set of tuples each of which is 
used to create an instance. There is a one-to-one correspondence between the tuples that 
are in the result of a SQL query at a node and the instances of that node. This semantics 
is similar to that of the Skolem functions used in [9]. Therefore, a key (that might he 
composed of attributes coming from different relations) is associated with each node in 
the XML document and is in the set of projected attributes at that node. Each distinct 
value of the key determines a distinct instance of the node to which that key is associated. 

In an XML document, parent/child relationships correspond to key/foreign key joins 
between parent queries and children queries. Thus, in order to huild the XML document 
tree structure, the query used at a node must always include the query at its parent node. 
Therefore, queries at nodes are dehned as follows. 

Definition 3 [Queries] Given query Qp = irpcxpp at node p (p is the key at node p) 
and query at node c, if c is a child of p, then Qc ^ defined by one of: 

— Qc = T^p.cSxpp. Qc is a simple projection on the expression used to evaluate the 
parent node p. 

— Qc = T^p,c{expp XI expc). Qc w o projection on a Join expression containing the 
parent node expression. 

If the relational schema is highly normalized, a join expression is often necessary in 
computing the instances of XML nodes. In the case of a multiple node, this join is used 
to build a one-to-many relationship from flat relational data. If the relational schema 
contains an un-normalized relation, simple projections often suffice to compute node 
values. 

In the dehnition given above, expc can be any expression that might include an 
arbitrary number of joins. It is necessary that the key value p used to compute the parent 
node p, is a subset of the keys of its children nodes. An example is the query used to 
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evaluate atomic nodes such as Name . In this query, the customer identifier CUSTKEY is also 
projected out and determines the customer to which each name instance is associated. 
The query used to evaluate the node RegNamie is an example of the presence of a join 
expression in the child query. In fact, in this particular case, even if a join expression is 
used, it is guaranteed that there is a unique region name per customer. The expression 
at the node PartName is an example of a query used to compute a multiple node. 

Given two queries corresponding to the parent and child elements or to two sibling 
elements, we rewrite them in two steps. First, these queries are merged resulting in a 
single SQL query. Second, common sub-expression elimination is applied to the merged 
query if it contains redundant expressions. We now define query merging and common 
sub-expression elimination in our context. 

3.2 Query Merging 

Definition 4 [Parent/child Merging] Given a node p and its query Qp and a node c, 
which is a child of p and its query Qc, merging Qp and Qc results in a query Q defined 
as follows : 

1. IfQp = TTpCxpp and Qc = TTp,cexpp, then Q = TVp^c^xpp. 

2. If Qp = TTpCxpp and Qc = T^p,c{expp M expc), then Q = 
{TTpexpp)lA {iTp^ciexpp M expc))- 

An example of the first merge is the case of merging the Customer query with the 
query at its child node Name. An example of the second one is the case of merging the 
Customer query with its child node PartName. 

Definition 5 [Sibling Merging] Given two sibling nodes ci and C 2 sharing a common 
parent p, merging their queries Qd tmd Qc^ results in a query Q defined as follows: 

1. If Cl and C 2 are both atomic nodes such that Q^ = Ttp^ciGxp and Qc^ = Ttp,c- 2 &xp, 
then Q = TTp.ci,c 2 ^xp. 

2. If one of Cl or C 2 is a multiple node, then Q = Qd U Qc 2 where U is an outer-union. 

An example of the first kind of sibling merge is the case of nodes RegNamie and RegCom. 
In this case, the expression that merges queries at those nodes has a simple union of the 
projection lists of the two initial queries. The second sibling merge case is more general. 
An example of that is the query that results from merging the queries at nodes PartNemie 
and SuppName (see Section 2). 

3.3 Exploiting Common Sub-expressions 

Since each query must contain its parent query, the largest expression a parent query 
shares with its children queries is itself. Thus, given two sibling queries, the smallest 
expression these queries have in common is their parent query. Sibling queries could 
have more in common, though. The query obtained from either the parent/child or sibling 
query merging will often contain redundant expressions. 

The first case of parent/child merging (in Definition 4) and the first case of sibling 
merging (in Definition 5) rewrite the two input queries in a way that factors out the 
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common expression between the two queries. The two remaining cases, in the same 
definitions, are the cases that will be discussed in this section. 

Definition 6 [Parent/child Sharing] The merged query is Q = {TTpexpp)Jxi 
{'^P,c{^xpp N expc)), where (iTpexpp) is the parent expression, which can be factored 
out as follows: Q = TTp^dexpp J>< expc). 

One might think that it is always a good idea to factor out the common parent 
expression after a parent/child merge. Our experiments, in Section 5, show that this 
choice might not always be the best depending on the presence of selection predicates 
in the expressions. 

Definition 7 [Sibling Sharing] Given two sibling queries Qci ond Qc 2 > where C 2 is a 
multiple node, depending on c\ being atomic or multiple, the merged query Q is defined 
by: 

1. Q = (TTp^c^expp) U {TTp,c^{expp XI expc^))- 

2. Q = (7Tp^ci(expp X expcf)) U {-Xp^aA^xpp X expcfi). 

The query in the second sibling sharing case is the one that deserves most attention. 
Let us consider the example of merging PartName and SuppName (in the DTD of Fig. 1) 
and write the corresponding queries in SQL. 

Below, we give three versions of the query that merges QpartName and QsuppName- Q, 
MaxQ and MinQ are three equivalent queries where common sub-expressions are treated 
differently. 

q 

select Q.ckey, Q.sname, Q.pname 
from 

((select distinct 1 as L, C_CUSTKEY as ckey, S_SUPPKEY as skey, S_NAME as sname, 

NULL as pkey, NULL as pname 
from CUSTOMER, SUPPLIER , LINEITEM , ORDERS 
where S_SUPPKEY=L_SUPPKEY and L_ORDERKEY=D_ORDERKEY 
and C.CUSTKEY = Q.CUSTKEY and C.ACCTBAL < 5000 ) 

UNION ALL 

(select distinct 2 as L, C_CUSTKEY as ckey, 

NULL as skey, NULL as sname, 

P_PARTKEY as pkey, P_NAME as pname 
from CUSTOMER , PART , LINEITEM , ORDERS 
where P_PARTKEY=L_PARTKEY and L_ORDERKEY=D_ORDERKEY 
and C.CUSTKEY = O.CUSTKEY and C.ACCTBAL < 5000) 

) Q 

order by Q.ckey,L; 



Q is the “naive” query where QpartName and QsuppName are merged using an outer- 
union. No common expression elimination is applied to it. Dummy field L is introduced 
to separate parts and suppliers for each customer (for the purpose of building the final 
XML document). 

MaxQ 

select distinct C_CUSTKEY as ckey, Q . sname , Q . pname 
from ORDERS, CUSTOMER, LINEITEM 
((select 1 as L, S_SUPPKEY as skey, 

S_NAME as sname, NULL as pkey, NULL as pname 
from SUPPLIER) 
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UNION ALL 

(select 2 as L, NULL as skey, NULL as sname, 

P_PARTKEY as pkey, P_NAME as pname 
from PART) 

) Q 

where C_ACCTBAL < 5000 and C_CUSTKEY = O.CUSTKEY 

and L.ORDERKEY = Q.QRDERKEY and (Q.skey = L.SUPPKEY or Q.pkey = L.PARTKEY) 
order by C_CUSTKEY,L; 



MaxQ is a rewriting of Q where the common join between ORDERS , CUSTOMER and 
LINEITEM is factored out. This join is the largest shared expression between the two 
siblings. The disjunctive condition (Q.skey = L_SUPPKEY or Q.pkey = L_PARTKEY), re- 
sults from the fact that the common join expression needs to be joined with suppliers 
using Q . skey = L_SUPPKEY and with parts using Q . pkey = L_PARTKEY. The outer-union 
operation now computes all suppliers and parts. ^ 



MinQ 

select distinct C_CUSTKEY as ckey, Q. sname, Q. pname 
from ORDERS, CUSTOMER, 

((select 1 as L, L_0RDERKEY, S_NAME as sname, NULL as pname 
from SUPPLIER, LINEITEM 
where S_SUPPKEY=L_SUPPKEY) 

UNION ALL 

(select 2 as L, L_0RDERKEY, NULL as sname, P_NAME as pname 
from PART, LINEITEM 
where P_PARTKEY=L_PARTKEY) 

) Q 

where C_ACCTBAL< 5000 

and C_CUSTKEY = 0_CUSTKEY and Q.L_0RDERKEY = O.ORDERKEY 
order by C.CUSTKEY, Q .L; 



MinQ is a variant of MaxQ where the main goal is to avoid disjunctive join condi- 
tions and cope with current day relational optimizers which are unable to deal with 
disjunctive predicates efficiently. However, since MinQ replicates a portion of the com- 
mon join condition (in the example, the one with the LINEITEM table), it might not 
always perform better than MaxQ. Therefore, we explore both rewritings: complete com- 
mon sub-expression elimination which results in unions and may introduce disjunctive 
conditions and partial common sub-expression elimination which results in unions and 
has only conjunctive predicates. 



4 Optimization 

We designed two greedy algorithms: OptimizeSiblings () and OptimizeAII () . 
OptimizeSiblings 0 explores merges and computation sharing between sibling 
queries only while OptimizeAII () interleaves merging of sibling queries and merging 
of parent/child queries. 

4.1 Sibling Optimization 

OptimizeSiblings 0, given in Algorithm 2, explores the benefits of sibling merging 
for each pair of sibling nodes. The benefit of merging two sibling queries can be either 

* This may not be the case if additional predicates on parts and/or suppliers are used. 
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Algorithm 1 compute_benef it() Algorithm 

Require: x, y 

1: #define cost(q) (w_p*proc(q)+w_c*comm(q)) 

2: c_before_merge = cost(x) + cost(y) 

3: c .after .merge = min(cost(Q(x,y)), cost(MaxQ(x,y)), cost(MinQ(x,y))){pick best sibling 
merge} 

4: benefit(x,y) = c_before_merge — c_after_merge 



Algorithm 2 DptimizeSiblingsO Algorithm 
Require: Tree 
1: while not_empty(s_list) do 

2: pick (x,y): max(benefit(x,y)) in sJist {Pick most beneficial siblings to merge} 

3: stop if benefit(x,y)<0 

4: siblingjnerge jewrite(x,y) (replace x, y with merged query} 

5: children(x)+=children(y) (y-subtree is attached to x} 

6: remove(y,*) from s.list 

7: remove(*,y) from s_list 

8: compute.benefit(x,*) in sJist 

9: compute.benefit(*,x) in sJist 

10: end while 



positive or negative. It is computed as the difference between the processing and com- 
munication costs of the two queries at nodes x and y and the rewritten query (where both 
queries are merged) (see Algorithm 1). 

Candidate pairs are stored in a list s.list. At each step in the optimization algorithm, 
the sibling pair that offers the best benefit (say (x,y)) is selected to be rewritten. The 
query at node x now contains the merged expression between x and y. The query at node 
y no longer exists. Thus, all pairs of the form (y , *) and (* ,y) are removed from the 
candidate sibling merges s.list. This includes (x,y), which is no longer a candidate 
pair. Finally, since the query expression at node x has been modified, the algorithm 
recomputes the benefit of all candidate sibling merges that involve node x (i.e., (x, *) 
and (* ,x)). 

Once two sibling queries are merged, OptimizeSiblings () rewrites them to elim- 
inate common computation. The algorithm chooses the best of Q, MaxQ and MinQ using 
compute.benef its 0 . 

If two sibling queries are merged, their children queries become siblings and could 
be considered for additional sibling merges. However, the potential for these new sibling 
queries to share large common sub-expressions reduces. In addition, considering these 
queries for sibling merging would increase the search space size. Therefore, as a heuristic, 
the only candidate sibling pairs (x , y ) we consider are the ones where x and y are siblings 
in the initial set of queries. 

4.2 Combined Optimization 

OptimizeAllO is given in Algorithm 3. Once a parent/child or a sibling merge has 
been performed, the difference between OptimizeAllO and OptimizeSiblings () 
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Algorithm 3 DptimizeAllO Algorithm 
Require: Tree 

1: while not_empty(list=union(s_list,pc_list)) do 
2: pick (x,y): max(benefit(x,y)) in list 

3: stop if beneht(x,y)<0 

4: if (x,y) in sjist then 

5: sibling_merge_rewrite(x,y) 

6: else 

7: pc_merge_rewrite(x,y) 

8: end if 

9: children(x)+=children(y) 

10: remove (y,*) and (*,y) from s_list 

11: compute _benefit(x,*) in sJist 

12: compute_benefit(*,x) in s_list 

13: remove(parent(y),y) frompcjist 

14: compute_benefit(parent(x),x) in pcJist 

15: compute_benefit(x,*) in pcJist 

16: end while 



is the impact on pc_list, the candidate parent/child merges list. Since node y does not 
exist anymore, (parent (y) ,y) needs to be removed from pc_list. In addition, since 
the query at node x now contains the merged query, the benehts of (parent (x) , x) and 
of (x , *) are recomputed. In order to remain within good complexity bounds, the same 
assumption as for OptimizeSiblings () is made on sibling merges. In addition, this 
assumption is also made for parent/child merges. When a parent/child merge (x,y) is 
performed, the subtree rooted at y becomes directly related to x. In this case, we do not 
consider the new children of x as candidate merges. 

4.3 Cost Analysis 

Given |5| queries, the maximal initial size of pc_list is |5|-1=0(|5|) and the maximal 
initial size of s_list is 27gg5(/(g)(/(q) — 1) /2) = 0(|S'p), where f{q) is the fanout of 
query q (number of children queries) in the XML tree. The initialization of the two lists 
needs 0(|S'p) time and space. At each step where a pair {x, y) is selected, we remove 
at least one element from pc_list. For each node q whose children are in sdist there 
can be at most f{q) sibling merges each taking at most 0{f{q)) time. Therefore, the 
number of iterations is linear in the number of queries and each takes linear time. Thus, 
the number of steps required is linear in the number of queries: 0(|S'|) where the initial 
number of elements in s_list is at most 0(|S'p). 



5 Experiments 

Due to space constraints, we only present a short set of experiments that evaluate 
sibling rewritings against the rewriting techniques of [9] and [15]. These experiments 
were carried on a SOOMhz Pentium III PC with 256MB of main memory and refer to 
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Table 1. Execution Times (secs) 





Ql 


Q2| 


Q3 


Q12 


Q13 


Q23 


Q123 




1 Small-Doc | 


Q 


2.91 


154.46 


200.30 


242.86 


316.29 


568.88 


849.07 


MaxQ 


- 


- 


- 


1034.70 


1332.63 


473.48 


3338.18 


MinQ 


- 


- 


- 


- 


- 


349.11 


3136.87 




1 Large-Doc | 


Q 


11.51 


1193.03 


1439.83 


1396.97 


1695.97 


3626.64 


5897.71 


MaxQ 


- 


- 


- 


1352.12 


1650.23 


2366.28 


5794.31 


MinQ 


- 


- 


- 


- 


- 


2175.90 


5586.54 



an instance of the TPC-R [16] dataset using scaling factor 0.2. We used a commercial 
RDBMS for storing the data. All tables have indices on primary and foreign keys. 

5.1 Data 

Our documents conform to a simpler version of the DTD in Fig. 1 with only the Customer, 
PartName and SuppName nodes. There are three basic queries corresponding to the nodes 
of this DTD. Query Q1 instantiates the Customer node, Q2 the PartName node and Q3 
the SuppName node. 

We used the field C_CUSTKEY of the CUSTOMER table to control the size of the docu- 
ments we build. For the case denoted as “Small-Doc”, we instantiated the document for 
customers with C_CUSTKEY less than 5000 (i.e., 5000 tuples). The document denoted as 
“Large-Doc” is generated with no restrictions on C_CUSTKEY. The Customer node for 
both documents is “fat”, i.e. all fields from table CUSTOMER are published as attributes of 
Customer. Table 1 summarizes the execution times of all possible parent-child/sibling 
merges as well as the queries corresponding to each node in the DTD. Qij denotes the 
merged result of queries Qi and Qj . For example Q 12 is the result of a parent/child merge 
of Q1 and Q2, while Q23 stands for the sibling merge of Q2 and Q3. The second and third 
rows of the table are the modified queries (MaxQ for complete subexpression elimina- 
tion and MinQ for partial subexpression elimination). Note that common subexpression 
elimination is not defined for all queries (e.g. Ql). 

5.2 Results 

Looking at the execution times for the Small-Doc case, a first observation is that ex- 
tensive common subexpression elimination in some cases results in substantially worse 
performance. The reason for this effect is twofold. The first is that common subexpres- 
sion elimination might generate disjunctive predicates that are hard to optimize (see 
query MaxQ in Section 3.3). The second reason is that often common expressions are 
helpful for preserving selections. For instance, both Q12 and Q13 make use of the selec- 
tion on C_CUSTKEY through the repeated join with the CUSTOMER table. Partial common 
expression elimination (MinQ23 and MinQ123 in this example) is better than complete 
subexpression elimination but (in the case of Q 123) no better than no common subexpres- 
sion elimination. In comparison with the complete common subexpression elimination 
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rewriting, the partial one maintains the join of table PART (resp. SUPPLIER) with table 
LINEITEM in the rewriting for Q2 (resp. Q3) in Q23 (same for Q123). This is necessary 
to avoid generating a disjunctive predicate (see MinQ in Section 3.3). 

In the Large-Doc case, common expression elimination pays off in all cases compared 
to executing Ql, Q2 and Q3 independently. This is because the common expression in 
each merged case is expensive and should be evaluated a minimal number of times. 
Furthermore, partial elimination of the common subexpression benefits both queries 
Q23 and Q123 as in the previous case. 

Looking at the complete times for producing the pieces of the document in the 
relational engine for all meaningful combinations of the aforementioned queries: [Ql, 
Q2 and Q3], [Q12 (parent/child merge) and Q3], [Q2 and Q13 (parent/child merge)], [Ql 
and Q23 (sibling merge)], [Q123 (single query)], plan Q1-HQ23 is marginally faster than 
plan Q1 h-Q2h-Q 3 in the Small-Doc case, while it is about 18% faster in the Large-Doc 
case. 



6 Related Work 



Since we are optimizing common sub-expressions among multiple queries, our work 
share similarities with multi-query optimization (see, e.g., [13]). It is well known 
that multi query optimization is exponential [13]. Our work benefits from application- 
dependent information (building XML trees) to optimize sibling queries instead of at- 
tempting to optimize an arbitrary subset of queries. This reduces the complexity of the 
optimization. 

In [7], the authors extend relational query engines with a new operator that processes 
sets of tuples. They define new rewriting rules that involve that operator and show how 
to integrate that operator in a relational optimizer. This work motivates the necessity to 
extend relational optimizers. In our work, we do not introduce a new operator, rather, 
we explore new rewriting rules. 

In [9], the authors focus on merging parent and children queries. The rewritings we 
propose are more general than the ones in [9] and explore an additional dimension that 
has been proven to result in better efficiency. In [15], the authors provide an extension to 
SQL to express XML views of relations and carry an experimental study of publishing 
relational data in XML. This work has not adopted an optimization approach to this 
problem. 

Finally, in [4], the authors present ROLEX, a system that extends the capabilities 
of relational engines to deliver efficiently navigable XML views of relational data via a 
virtual DOM interface. DOM operations are translated into an execution plan in order 
to explore lazy materialization. The query optimizer uses a characterization of the navi- 
gation behavior of an application to minimize the expected cost of that navigation. This 
work could benefit from our new optimizations if they are integrated into a relational 
system. 
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7 Conclusion 

We discussed the problem of efficiently building XML documents from relations and 
showed that exploring common computation between sibling queries is a fundamental 
algebraic rewriting when optimizing SQL queries used to build XML documents. In 
particular, we showed that in the case where an element has both unique and repeated 
children, sibling merging combined with partial common sub-expression elimination, 
enables computation sharing without replicating data. This strategy can be used both 
inside and outside a relational engine. 
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Abstract. Processing XML documents in multi-user database management envi- 
ronments requires a suitable storage model of XML data, support of typical XML 
document processing (XDP) interfaces, and concurrency control (CC) mecha- 
nisms tailored to the XML data model. In this paper, we sketch the architecture 
and interfaces of our prototype native XML database management system which 
can he connected to any existing relational DBMS and provides for declarative and 
navigational data access of concurrent transactions. We describe the Hne-grained 
CC mechanisms implemented in our system and give a first impression of the so 
achieved benefits for concurrent transaction processing in native XML database 
management systems. 



1 Introduction 

Run an experiment on available DBMSs with collaboratively used XML documents 
[16] and you will experience a "performance catastrophe" meaning that all transactional 
operations are processed in strict serial order. Storing XML documents into relational 
DBMSs forces the developers to use simple CLOBs or to choose among an innumerable 
number of algorithms mapping the semi-structured documents to tables and columns (the 
so-called shredding). In any case, there are no specific provisions to process concurrent 
transactions guaranteeing the ACID properties and using typical XDP interfaces like 
SAX [2], DOM [16], and XQuery [16] simultaneously. Especially isolation in relational 
DBMS does not take the properties of the semi- structured XML data model into account 
and causes disastrous locking behavior by blocking entire CLOBs or tables. 

Native XML database systems often use mature storage engines tailored to relational 
structures [13]. Because their XML document mapping is usually based on fixed number- 
ing schemes used to identify XML elements, they primarily support efficient document 
retrieval and query evaluation. Frequently concurrent and transaction-safe modifications 
would lead to renumeration of large document parts which could cause unacceptable re- 
organization overhead and degrade XML processing in performance-critical workload 
situations. As a rare example of an update-oriented system, Natix [5] is designed to sup- 
port concurrent transaction processing, but accomplishes alternative solutions for data 
storage and transaction isolation as compared to our proposal. 

Our approach aims at the adequate support of all known types of XDP interfaces 
(event-based like SAX, navigational like DOM, and declarative like XQuery) and pro- 
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vides the well-known ACID properties [7] for their concurrent execution. We have im- 
plemented the XML Transaction Coordinator (XTC) [9], an (O)RDBMS-connectahle 
DBMS for XML documents, called XDBMS for short, as a testhed for empirical trans- 
action processing on XML documents. Here, we present its advantages for concurrent 
transaction processing in a native XDBMS achieved hy a storage model and CC mecha- 
nisms tailored to the XML data model. This specific CC improves not only collaborative 
XDP hut also SQL applications when "ROX; Relational Over XML" [8] becomes true. 

An overview of the XTC architecture and their XDP interfaces is sketched in Section 
2. Concurrent data access is supported by locks tailored to the taDOM tree [10] — a data 
model which extends the DOM tree — as outlined in sections 3 and 4, thereby providing 
tunable, fine-grained lock granularity and lock escalation as well as navigational trans- 
action path locking inside an XML document. In Section 5, we give a first impression of 
concurrent transaction processing gains, before we wrap up with conclusions and some 
aspects of future work in Section 6. 



2 System Architecture and XDP Interfaces 



Our XTC database engine (XTCserver) adheres to the widely used five-layer DBMS 
architecture [11]. In Figure 1, we concentrate on the representation and mapping of 
XML documents. Processing of relational data is not a focus of this paper. 

The file-services layer operates on the bit pattern stored on external, non-volatile 
storage devices. In collaboration with the OS file system, the i/o managers store the 
physical data into extensible container files’, their uniform block length is configurable 
to the characteristics of the XML documents to be stored. A buffer manager per container 
file handles fixing and unfixing of pages in main memory and provides a replacement 
algorithm for them which can be optimized to the anticipated reference locality inher- 
ent in the respective XDP applications. Using pages as basic storage units, the record, 
index, and catalog managers form the access services. The record manager maintains 
in a set of pages the tree-connected nodes of XML documents as physically adjacent 
records. Each record is addressed by a unique life-time ID managed within a B-tree by 
the index manager [9]. This is essential to allow for fine-grained concurrency control 
which requires lock acquisition on unique identifiable nodes (see Section 4). The catalog 
manager provides for the database metadata. The node manager implementing the navi- 
gational access layer transforms the records from their internal physical into an external 
representation, thereby managing the lock acquisition to isolate the concurrent transac- 
tions. The XML-services layer contains the XML manager responsible for declarative 
document access, e. g., evaluation of XPath queries or XSLT transformations [16]. 

At the top of our architecture, the agents of the interface layer make the functionality 
of the XML and node services available to common internet browsers, ftp clients, and 
the XTCdriver thereby achieving declarative / set-oriented as well as navigational / 
node-oriented interfaces. The XTCdriver linked to client-side applications provides for 
methods to execute XPath-like queries and to manipulate documents via the SAX or 
DOM API. Each API accesses the stored documents within a transaction to be started 
by the XTCdriver. Transactions can be processed in the well-known isolation levels 
uncommitted, committed, repeatable, and serializable [1]. 
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Fig. 1. XTC architecture overview 



3 Storage Model 

Efficient and effective synchronization of concurrent XDP is greatly facilitated if we 
use a specialized internal representation which enables fine-granular locking. For this 
reason, we will introduce two new node types: attributeRoot and string. This represen- 
tational enhancement does not influence the user operations and their semantics on the 
XML document, but is solely exploited by the lock manager to achieve certain kinds of 
optimizations when an XML document is modihed in a cooperative environment. As a 
running example, we, therefore, refer to an XML document which is slightly enhanced 
for our purpose to a so-called taDOM tree [10], as shown in Figure 2. 

AttributeRoot separates the various attribute nodes from their element node. Instead 
of locking all attribute nodes separately when the DOM method getAttibutes( ) is invoked, 
the lock manager obtains the same effect by a single lock on attributeRoot. Hence, such 
a lock does not affect parallelism, but leads to more effective lock handling and, thus, 
potentially to better performance. A string node, in contrast, is attached to the respective 
text or attribute node and exclusively contains the value of this node. Because reference 
to that value requires an explicit invocation of getValue( ) with a preceding lock request, 
a simple existence test on a text or attribute node avoids locking such nodes. Hence, a 
transaction only navigating across such nodes will not be blocked, although a concurrent 
transaction may have modified them and may still hold exclusive locks on them. 
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Fig. 2. A sample taDOM tree 



It is essential for the locking performance to provide a suitable storage structure for 
ta- DOM trees which supports a flexible storage layout that allows a distinguishable 
(separate) node representation of all node types to achieve fine-grained locking. There- 
fore, we have implemented various container types which enable effective storage of 
very large and very small attribute and element nodes as well as combinations thereof 
[9]. Furthermore, fast access to and identification of all nodes of an XML document is 
mandatory to enable efficient processing of direct-access methods, navigational meth- 
ods, and lock management. For this reason, our record manager assigns to each node a 
unique node ID (rapidly accessible via a B-tree) and stores the node as a record in a data 
page. The tree order of the XML nodes is preserved by the physical order of the records 
within logically consecutive pages (chained by next/previous page pointers) together 
with a so-called level Indicator per record. 

4 Concurrency Control 

So far, we have explained the newly introduced node types and how fast and selective 
access to all nodes of an XML document can be guaranteed. In a concurrent environment, 
the various types of XML operations have to be synchronized using appropriate protocols 
entirely transparent to the different XDP interfaces supported. Hence, a lock manager is 
responsible for the acquisition and maintenance of locks, processing of the quite complex 
locking protocols and their adherence to correctness criteria, as well as optimization 
issues such as adequate lock granularity and lock escalation. 

Because the DOM API not only supports navigation starting from the document root, 
but also allows jumps “out of the blue" to an arbitrary node within the document, locks 
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must be automatically, that is, by the lock manager, acquired in either case for the path 
of ancestor nodes. The currently accessed node is called context node in the following. 

This up-to-the-root locking procedure is performed as follows: If such an ancestor 
path is traversed the first time and if the IDs of the ancestors are not present in the 
so-called parent index (on-demand indexing of structural relationships [9]) for this path, 
the record manager is invoked to access stored records thereby searching all ancestor 
records. The IDs of these records are saved in the parent index. Hence, future traversals 
of this ancestor path can be processed via the parent index only. Navigational locking 
of children or siblings is optimized by such structural indexes in a similar way. 

The lock modes depend on the type of access to be performed, for which we have 
tailored the node lock compatibilities and defined the rules for lock conversion as outlined 
in Section 4.1 and Section 4.2. To achieve optimal parallelism, we discuss means to tune 
lock granularities and lock escalation in Section 4.3. When an XML document has to 
be traversed by navigational methods, then the actual navigation paths also need strict 
synchronization. This means, a sequence of method calls must always obtain the same 
sequence of result nodes. To support this demand, we present so-called navigation locks 
in Section 4.4. Furthermore, query access methods also need strict synchronization to 
accomplish the well-known repeatable read property and, in addition, the prevention of 
phantoms in rare cases. Our specific solution is outlined in Section 4.5. 



4.1 Node Locks 

While traversing or modifying an XML document, a transaction has to acquire a lock 
in an adequate mode for each node before accessing it. Because the nodes in an XML 
document are organized by a tree structure, the principles of multi-granularity locking 
schemes can be applied. The method calls of the different XDP interfaces used by an 
application are interpreted by the lock manager to select the appropriate lock modes 
for the entire ancestor path. Such tree locking is similar to multi-granularity locking 
in relational environments (SQL) where intention locks communicate a transaction’s 
processing needs to concurrent transactions. In particular, they prevent a subtree s from 
being locked in a mode incompatible to locks already granted to s or subtrees of s. 
However, there is a major difference, because the nodes in an ancestor path are part of 
the document and carry user data, whereas, in a relational DB, user data is exclusively 
stored in the leaves (records) of the tree (DAG) whose higher-level nodes are formed 
by organizational concepts (e. g., table, segment, DB). For example, it makes perfect 
sense to lock an intermediate XML node n for reads, while in the subtree of n another 
transaction may perform updates. For this and other reasons, we differentiate the read 
and write operations thereby replacing the well-known (IR, R) and (IX, X) lock modes 
with (NR, LR, SR) and (IX, CX, X) modes, respectively. As in the multi-granularity 
scheme, the U mode plays a special role because it permits lock conversion. Figure 3a 
contains the compatibility matrix for our lock modes whose effects are described now: 

• An NR lock mode (node read) is requested for reading the context node. To isolate 
such a read access, an NR lock has to be acquired for each node in the ancestor path. 
Note, the NR mode takes over the role of IR together with a specialized R, because 
it only locks the specihed node, but not any descendant nodes. 
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• An IX lock mode (intention exclusive) indicates the intent to perform write opera- 
tions somewhere in the subtree (similar to the multi-granularity locking approach), 
but not on a direct-child node of the node being locked (see CX lock). 

• An LR lock mode (level read) locks the context node together with its direct-child 
nodes for shared access. For example, the method getChildNodes() only requires 
an LR lock on the context node and not individual NR locks for all child nodes. 
Similarly, an LR lock, requested for an attributeRoot node, locks all its attributes 
implicitly (to save lock requests for the getAttributes( ) method). 

• An SR lock mode (subtree read) is requested for the context node c as the root 
of subtree s to perform read operations on all nodes belonging to s. Hence, the 
entire subtree is granted for shared access. An SR lock on c is typically used if s is 
completely reconstructed to be printed out as an XML fragment. 

• A CX lock mode (child exclusive) on context node c indicates the existence of 
an X lock on some direct-child node and prohibits inconsistent locking states by 
preventing LR and SR lock modes. In contrast, it does not prohibit other CX locks on 
c, because separate direct-child nodes of c may be exclusively locked by concurrent 
transactions. 

• A U lock mode (update option) supports a read operation on context node c with the 
option to convert the mode for subsequent write access. It can be either converted 
back to a read lock if the inspection of c shows that no update action is needed or 
to an X lock after all existing read locks on c are released. Note, the asymmetry in 
the compatibility definition among U and (NR, IX, LR, SR, CX) which prevents 
granting further read locks on c, thereby enhancing protocol fairness, that is, avoiding 
transaction starvation. 

• To modify the context node c (updating its contents or deleting c and its entire 
subtree), an X lock mode (exclusive) is needed for c. It implies a CX lock for its 
parent node and an IX lock for all other ancestors up to the document root. 
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Fig. 3. Node locking for the taDOM tree 
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Note again, this differing behavior of CX and IX locks is needed to enable compat- 
ibility of IX and LR locks and to enforce incompatibility of CX and LR locks. 

Figure 3b represents a cutout of the taDOM tree depicted in Figure 2 and illustrates 
the result of the following example: Transaction Ti starts modifying the value Darcy 
and, therefore, acquires an X lock for the corresponding string node. The lock manager 
complements this action by accessing all ancestors and by acquiring a CX lock for the 
parent and IX locks for all further ancestors. Simultaneously, transaction T 2 wants to 
delete the entire < editor > node including the string Gerbag for which T 2 must acquire 
an X lock. This lock request, however, cannot be immediately granted because of the 
existing IX lock of Ti. Hence, T 2 -placing its request in the lock request queue (LRQ: 
X 2 ) - must synchronously wait for the release of the IX lock of Ti on the < editor > 
node. Meanwhile, transaction is generating a list of all book titles and has, therefore, 
requested an LR lock for the < bib > node to obtain read access to all direct-child nodes 
thereby using the level-read optimization. To access the title strings for each < book > 
node, the paths downwards to them are locked by NR locks. Note, LR^ on < bib > 
implicitly locks the < book > nodes in shared mode and does not prohibit updates 
somewhere deeper in the tree. If X 2 is eventually granted for the < editor > node, T 2 
gets its CX lock on the < book > node and its IX locks granted up to the root. 

4.2 Node Lock Conversion 

The compatibility matrix shown in Figure 3 a describes the compatibility of locks ac- 
quired on the same node by separate transactions. If a transaction T already holds a lock 
and requests a lock in a more restrictive or incomparable mode on the same node, we 
would have to keep two locks for T on this node. In general, k locks per transaction and 
node are conceivable. This proceeding would require longer lists of granted locks per 
node and a more complex run-time inspection algorithm checking for lock compatibility. 
Therefore, we replace all locks of a transaction per node with a single lock in a mode 
giving sufficient isolation. The corresponding rules are specified by the lock conversion 
matrix in Figure 4, which determines the resulting lock for context node c, if a transac- 
tion already holds a lock (matrix header row) and requests a further lock (matrix header 
column) on c. A lock specified by an additional subscripted lock 12 (e. g., CXmr) 
means that has to be acquired on c and I 2 has to be acquired on each direct-child node 
of c. An example for this procedure is given in the now following paragraph. 

Assume, a user starts a transaction request- 
ing all child nodes of c which results in ac- 
quiring an LR lock on c. LR mode locks c and 
all direct-child nodes in shared mode. After 
that, the user wants to delete one of the previ- 
ously determined child nodes. Therefore, the 
transaction acquires an X lock on the corre- 
sponding child node and — applying the lock- 
ing protocol — this requires the acquisition of 
a CX lock on c which already holds the LR 
lock. Using rule CXmr specified in Figure 4, 
the transaction has to convert the existing LR 
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lock on c to a CX lock and to acquire an NR lock on each direct-child node of c (except 
the child node which is already locked for deletion by an X lock). 



4.3 Tenable Node Lock Granularity and Lock Escalation 

Entire subtrees in the taDOM tree can be locked by both SR locks enabling shared 
access or X locks granting exclusive access. In either case, we want to improve flexi- 
bility, efficiency, and potential parallelism of our locking protocols by enabling tunable 
lock granularity and lock escalation. The combined use of them increases operational 
throughput because, due to lock escalation, the number of lock requests can be reduced 
enormously and, due to fine-tuned lock granularity, higher concurrency may be gained. 

To tune the lock granularity of nodes for each transaction separately, the parameter 
lock depth {ld> 0) is introduced. Parameter Id describes the lock granularity by means 
of the number of node levels (from document root) on which locks are to be held. If a 
lock is requested for context node c whose path length to the document root element is 
greater than Id, only an SR lock for the ancestor node belonging to the lock-depth level 
is requested. In this way, nodes at deeper levels than indicated by Id are locked in shared 
mode using an SR lock on the node at level Id, that is, entire subtrees are locked starting 
at the specified lock-depth level of the requesting transaction. As a corollary. Id = 0 
provides document locks, e.g., locks on the < bib > node in Figure 5. This allows the 
traversal of a large document fragment in read mode without acquiring any additional 
node locks. In the same way, several X locks can be replaced with a single X lock at a 
chosen document level I < Id. 

Figure 5 shows the taDOM-tree cutout of 
Figure 3b illustrating the effect of the lockdepth 
parameter. With Id = 2, the NR locks of trans- 
action T 3 on the < title > and < editor > 
nodes are replaced with SR locks for the < 
title > nodes. The IX, CX, and X locks of Ti 
on the < editor > node and its descendants are 
replaced by a single X lock on the < editor > 
node. As a prerequisite, it requires CX and IX 
locks on the ancestor nodes < book > and 
< bib >, respectively. Transaction T 2 is again 
in a wait state, because the requested X lock is 
not compatible to the existing X lock of Ti . 

In a similar way, lock escalation can be 
achieved. To tune lock escalation, we introduce two parameters, the escalation threshold 
(et) and the escalation depth (ed). The lock manager scans the taDOM tree at prespeci- 
fied intervals. If the manager detects a subtree in which the number of locked nodes of 
a transaction exceeds the percentage threshold value defined by et, the locks held are 
replaced by an adequate lock at the subtree root, if possible (i. e., no conflicting locks 
are encountered). Read and write locks are replaced by SR and X locks. The parameter 
ed defines the maximal subtree depth starting from the leaves of a taDOM tree up to the 
scanned subtree root. Obviously, there is certainly a trade-off to be observed for lock 
escalation which decreases concurrency of read and write transactions, but, in turn, a 




Fig. 5. Coarse-grained node locks with lock 
depth 2 
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reduction of the number of held locks and of lock acquisitions is achieved saving lock 
management overhead. Its empirical evaluation remains a future task. 
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Fig. 6. Locking navigational operations in a taDOM tree 



4.4 Navigation Locks 

So far, we have discussed optimization issues for locks where the node to be accessed was 
specified by its unique ID. In addition, the DOM API also provides for 20) methods 
which enable the traversal of XML documents where access is specified relative to the 
context node. In such cases, synchronizing a navigation path means that a sequence of 
navigational method calls or modification (lUD) operations — starting at a known node 
within the taDOM tree — must always yield the same sequence of result nodes within 
a transaction. Hence, a path of nodes within the document evaluated by a transaction 
must be protected against modifications of concurrent transactions. Assume in Figure 
2, a transaction T navigates through all or a range of < book > nodes and wants to be 
isolated from concurrent inserts of new < book > 
nodes. Of course, we have already introduced 
some lock modes which enable in this situation 
perfect, but (too) expensive isolation caused by 
(too) large lock granules. For example, if we ac- 
quire an LR lock on the < bib > node, all 
< book > nodes are implicitly granted in shared 
mode. An SR lock on < bib > would even pro- 
hibit updates on the entire document. We, how- 
ever, want to support a solution only using mini- 
mal lock granules, that is, node locks of mode NR. 

Therefore, we introduce virtual navigation edges 
for element and text nodes within the taDOM tree 
(Figure 6b) which are locked in addition to their 
confining nodes. 

While navigating through an XML document and traversing the navigation edges, 
a transaction has to request a lock for each edge, in addition to the node locks (NR) 
for the nodes visited. Note, these edges are logical objects which are not materialized 
but embodied by their confining nodes. Because each navigation step only performs 
local operations (hrst/last, next/previous) to a sibling or child of the context node c, the 
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R/U/ X locks known from relational records or tables are sufficient. Traversal operations 
between nodes need bidirectional isolation: For example, if getNetxtSibling( ) is invoked 
on node c and delivers node n, then, as a first step, the next-sibling edge of c is locked. 
In addition, we must lock the previous-sibling edge of n to prohibit path modifications 
between n and c through another transaction via node n. To support such traversals 
efficiently, we offer the ER, EU, and EX lock modes corresponding to R/U/X. Their use 
observing the compatibilities shown in Figure 6a can be summarized as follows: 

• An ER lock mode (edge read) is needed for an edge traversal in read mode, e. g., by 
calling the getNextSibling( ) or getFirstChild( ) DOM method for the nextSiblingEdge 
or firstChildEdge, respectively. 

• An EX lock mode (edge exclusive) enables an edge to be modified which may be 
needed when nodes are deleted or inserted. Eor all edges, affected by the modification 
operation, EX locks are acquired, before the navigation edges are redirected to their 
new target nodes. 

• The EU lock mode (edge update) eases the starvation problem of write transactions 
(see lock mode U in Section 4.1). 

Eigure 7 illustrates navigation locks on virtual navigation edges. To keep Figure 7 
comprehensible, we do not show the node locks, e.g., NR or CX. Transaction Ti starts 
at the < bib > node and reads three times the first-child node (that is, the node sequence 
< bib >, < book >, < title >, < text >) to get the string value (Data o . . . ) of the first 
book title. Then Ti refers to the next-sibling node of the current <book> node and repeats 
twice the first-child method to get the title of the second book. At this point, the requested 
book is located, and Ti finally gets the next sibling of the current < title > node which 
is the < editor > node. Apparently, our protocol allows concurrent transaction T 2 to 
append a new book by acquiring EX locks for the next-sibling edge of the last < book > 
node and for the last-child edge of the < bib > node. Of course, T 2 has to protect its 
ancestor path in a sufficient mode - its CX lock on < bib > is compatible with the NR 
lock of Ti . 

4.5 Prevention of Phantoms 

As outlined so far, our protocols enable fine-grained solutions for repeatable read and 
even serializable when record-oriented operations are used, i. e., direct as well as nav- 
igational access to sequences of document nodes. Note, “gaps" between nodes can be 
protected by edge locks which prohibit a newly inserted document node to appear as a 
phantom. 

But how do we solve the phantom problem in XML documents for set-oriented 
access? If we are willing to lock larger granules and thereby potentially sacrifice some 
parallelism, we can use the same trick known from multi-granularity locking: we just 
acquire an exclusive lock one level above the working node, that is, on its direct ancestor, 
and prevent the transaction from being confused by phantom inserts. Obviously, this 
straightforward approach also increases blocking and deadlock probability. For example, 
if the getElementsByTagName( ) method of the DOM API is invoked on an arbitrary node 
n, all its sibling nodes and their subtrees are locked, because the parent node of n holds 
the phantom-preventing lock. Hence, this approach may turn out to be too coarse. 
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Because we may not guarantee “serializability" in the strict sense when fine-grained 
lock protocols are used for set-oriented access, we currently support the so-called con- 
sistency level 2.99 [7] in such situations. Our mechanism described in [10] is based 
on the concept of precision locks [14]. Because our empirical experiments outlined in 
Section 5 do not critically rely on effective phantom protection, we will not refine it here. 
While it is path oriented and can, therefore, be also exploited for (simple) declarative 
interfaces, phantom prevention for the full expressiveness of XQuery is subject of our 
future research. 



5 Performance Evaluation 



In our hrst experiment, we consider the basic cost of lock management described so far. 
For this purpose, we use the xmlgen tool of the XMark XML benchmark project [15] to 
generate a variety of XML documents consisting of 5,000 up to 25,000 individual XML 
nodes. The documents are stored in our native XDBMS [9] and accessed by a client- 
side DOM application requesting every node by a separate RMI call. To reveal lock 
management overhead, each XML document is reconstructed by a consecutive traversal 
in depth-first order under isolation levels committed and repeatable read. Isolation level 
committed certainly provides higher degrees of concurrency with (potentially) lesser 
degrees of consistency of shared documents; when used, the programmer accepts a 
responsibility to achieve full consistency. Depending on the position of the node to be 
locked, it may cause much more overhead, because each individual node access requires 
short read locks along its ancestor path. In contrast, isolation level repeatable read 
sets long locks until transaction commit and, hence, does not need to repetitively lock 
ancestor nodes. In fact, they are already locked due to the depth-hrst traversal. 

These expectations are confirmed by 
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Fig. 8. Document reconstruction time 



the results of this hrst experiment as de- 
picted in Figure 8. The potential perfor- 
mance gain of the reduced isolation level 
committed is contrasted by the dramatically 
increasing lock management overhead due 
to repeated locking and releasing of locks 
along the entire ancestor path. Hence, sub- 
stantial lock processing time is consumed 
in committed mode (>300% of the recon- 
struction time under isolation level none, i. e., without locking overhead), whereas the 
overhead for repeatable read is acceptable (~ 25%). To guarantee highly consistent 
documents, repeatable read should be used for concurrent transactions. However, its 
penalty of longer lock durations has to be compensated by effective and hne-granular 
lock modes which coincides with the objectives of our proposal. 

The second experiment illustrates the benehts for transaction throughput depending 
on the chosen isolation level and lock-depth value. For this purpose, we extend the 
sample document of Figure 2 to a library database by grouping the books into specihc 
topics and adding a persons’ directory. The DataGuide describing the resulting XML 
document is depicted in Figure 9. We created the library document with 500 persons 
and 25,000 books grouped into 50 specihc topics. The resulting document (requiring 
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approximately 6,4 MB) consists of 483,317 XML nodes and is stored in our XDBMS 
[9]. 

We apply different transaction types simulating typical read/write access to XML 
documents. Transaction Ts is searching for a book with a randomly selected title. This 
simulates a query of a library visitor. The activities of the library employees are rep- 
resented by transactions Tp, T/,, and Tp. Transaction Tp is searching for a randomly 
chosen person by his/her last name. Transactions Tp and Tp are simulating the lending 
of books. Transaction T p randomly locates a person and a book to be lent; then it adds a 
new child node containing the person’s id to the < history > element within the located 
< book > subtree. Transaction Tp "returns" the book by setting the return attribute of 
the corresponding < lend > element to the current system date. 

Ten clients with read transactions of type T p and one client with a 
read transaction of type Tp are continuously executing for ten minutes on 
the library document to provide 
a base load on the XDBMS. Two 
clients are executing write trans- 
actions of type T p and Tp mak- 
ing a total of 1 3 concurrent trans- 
actions in the system. A deadlock 
detector is scanning the wait- 
for graph of the transactions ev- 
ery five seconds. The XDBMS 
is running on an IBM eServer 
xSeries 235 with two Intel Xeon- 
A 2.4GHz processors. The server 
machine executes Microsoft Windows Server 2003 Enterprise Edition, whereas the 
clients are running on an IBM R32 Think- Pad, connected with a 100Mbps network 
to the server. 

To explore transaction throughput in two different processing modes, we run this 
experiment in batch mode (no human interaction while a transaction is running) and 
with human interaction. The latter case is simulated by a delay of 5 seconds by which 
the duration of long locks is extended in each transaction, before they are released 
at transaction commit. At least in relational environments, everybody would expect a 
decrease of transaction throughput with increasing isolation levels: none, uncommitted, 
committed, repeatable read, serializable, where in our experiments both isolation levels 
repeatable read and serializable produce identical results. On the other dimension, with 
increasing lock depth — if facilitated by the element position processed in the tree — 
growing transaction throughput is anticipated because of shrinking lock granules. 

Without surprise, maximum transaction throughput is reached for isolation level 
uncommitted in all experiments, because read locks are abandoned. Write locks, in turn, 
seriously interfere with concurrent transactions only at lock depth 0 and 1 (see Figure 
10a and 1 la), whereas they hardly affect them at lock depths 2 to 7. 

5.1 Batched Transaction Processing 

As the most striking observation, conducting our experiment in batch-processing mode 
revealed in all cases a higher throughput at isolation level repeatable read than at com- 
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mitted, because the long read locks avoid the subsequent traversals of ancestor paths for 
lock acquisitions in most cases. The curves of committed write transactions (depending 
on the lock depth) are similar at all isolation levels (see Figure 10a). Most of the conflicts 
(waiting cycles or deadlocks) are occurring at lock depth 0 resp. 1, because transactions 
Ts, Tl, and Tij are locking the < bib > resp. the < topics > nodes. Hence, all isolation 
levels nearly yield the same throughput of write transactions. 

Enhancing the lock depth value from 1 to 2, much more write transactions commit, 
because most of them are executed concurrently (only those accessing the same topic 
have to be serialized). The number of successful write transactions is slightly increasing 
from lock depth 2 to 7, because the transactions are keeping long locks which avoid a 
repeated traversal of complete ancestor paths in most cases when additional locks are 
requested. But surprisingly, a higher degree of isolation also enables higher throughput of 
write transactions, which can be explained by the following observation: Repeatable read 
yields shorter transaction processing times than committed because the read operations 
of the write transactions T l and Ts do not acquire and immediately release (a set of) 
short read locks for each node access. 

The number of committed transactions (Figure 10b) is primarily depending on the 
commits of the readers T b and T p, because, compared to the writers Tp and Tp, they 
contain less operations and are executed by more client threads in parallel. Because of the 
short read locks, the throughput for committed behaves even worse than for repeatable 
read. The peak at lock depth 1 is caused by transaction Tp which is executed without 
interference while Tp, Tp, and Tp are frequently blocking each other. This peak number 
of commits (mainly due to Tp) decreases from lock depth 1 to 4, because more and 
more transactions of type Tp, Tp, and Tp successfully finish thereby increasing lock 
and transaction management overhead. At lock depth 4, locking conflicts of transactions 
Tp and Tp do not affect Tp anymore. Hence, from lock depths 4 to 7, the XDBMS 
seems to be in a kind of steady state and achieves stable transaction throughput. 

5.2 "Interactive" Transaction Processing 

Transactions interrupted by human interactions "the human is in the loop") or performing 
complex operations may exhibit drastically increased lock duration times. While the 
average transaction response time and lock duration was far less than a second in batch- 
processing mode, now the average lock duration was "artificially" increased probably 
by more than a factor of 10. As a consequence, the finer granularity of locks and the 
duration of short read locks gained in importance on transaction throughput while the 
relative effect of lock management overhead was essentially scaled down. Longer lock 
durations and, in turn, blocking times reduced the number of successful commits (write 
and overall transactions) to about 50% and 10% as shown in Figure 11a and b and caused 
a relative performance behavior as anticipated in relational environments. 

In general, transaction throughput can be increased by decreasing the level of iso- 
lation (from repeatable read down to uncommitted) or increasing the lock depth (if 
possible). As observed at lock depths 4 to 7 in Section 5.1, all transactions can be ex- 
ecuted in parallel and our XDBMS approaches stable transaction throughput in this 
experiment. 

For future benchmarks, we expect the gap between uncommitted and committed to 
grow larger for "deeper" XML documents (longer paths from the root to the leaves). 
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Fig. 10. Successful batch-processed transactions 




Fig. 11. Successful transactions with human interaction 




Similarly, the gap between committed and repeatable read widens with an increasing 
percentage of write transactions (causing more waiting cycles). 



6 Related Work, Conclusions, and Future Work 

So far, only a few papers deal with fine-grained CC in XML documents. DGLOCK [6] 
explores a path-oriented protocol for semantic locking on DataGuides. It is running in a 
layer on top of a commercial DBMS and can, therefore, not reach the fine granularity and 
flexibility of our approach. In particular, it cannot support ID-based access and position- 
based predicates. Another path-oriented protocol is proposed in [3, 4] which also seems 
to be limited as far as the full expressiveness of XPath predicates and direct jumps into 
subtrees are concerned. To our knowledge, the only competing approach which is also 
navigation oriented comes from the locking protocols designed for Natix [12]. They are 
also tailored to typical APIs for XDP. While the proposed lock modes are different to 
ours, the entire protocol behavior should be compared. Currently, we have the advantage 
that we do not need to simulate our protocols, but we can measure their performance on 
existing benchmarks and get real numbers. 

In this paper, we have primarily explored transaction isolation issues for collabo- 
rative XML document processing. We first sketched the design and implementation of 
our native XML database management system. For concurrent transaction processing, 
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we have introduced our concepts enabling fine-granular concurrency control on taDOM 
trees representing onr natively stored XML documents. As the key part, we have de- 
scribed the locking protocols for direct and navigational access to individual nodes of a 
taDOM tree, thereby snpporting different isolation levels. The performance evaluation 
has revealed the locking overhead of our complex protocols, but, on the other hand, has 
confirmed the viability, effectiveness, and benefits of onr approach. As a striking obser- 
vation, lower isolation levels on XML documents do not necessarily guarantee better 
transaction thronghpnt, because the potentially higher transaction parallelism may be 
(over-)compensated by higher lock management overhead. There are many other issues 
that wait to be resolved: For example, we did not say much about the usefulness of 
optimization features offered. Effective phantom control needs to be implemented and 
evaluated (thereby providing for isolation level serializable), based on the ideas we de- 
scribed. Then, we can start to systematically evaluate the huge parameter space available 
for collaborative XML processing (fan-out and depth of XML trees, mix of transactional 
operations, benchmarks for specific application domains, degree of application concur- 
rency, optimization of protocols, etc.). 
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Abstract. A key step in the optimization of declarative queries over XML data is 
estimating the selectivity of path expressions, i.e., the number of elements reached 
by a specific navigation pattern through the XML data graph. Recent studies have 
introduced XSketch structural graph synopses as an effective, space-efficient 
tool for the compile-time estimation of complex path-expression selectivities over 
graph-structured, schema-less XML data. Briefly, XSKETCnes exploit localized 
graph stability and well-founded statistical assumptions to accurately approximate 
the path and branching distribution in the underlying XML data graph. Empirical 
results have demonstrated the effectiveness of XSketch summaries over real-life 
and synthetic data sets, and for a variety of path-expression workloads. 

In this paper, we introduce fractional XSketchcs (IXSketches) a simple, yet 
intuitive and very effective generalization of the basic XSketch summarization 
mechanism. In a nutshell, our IXSketch synopsis extends the conventional no- 
tion of binary stability (employed in XSKETCHes) with that of fractional stability, 
essentially recording more detailed path/branching distribution information on 
individual synopsis edges. As we demonstrate, this natural extension results in 
several key benefits over conventional XSketchos, including (a) a simplified esti- 
mation framework, (b) reduced run-time complexity for the synopsis-construction 
algorithm, and (c) lifting the need for critical uniformity assumptions during es- 
timation (thus resulting in more accurate estimates). Results from an extensive 
experimental study show that our fXSKETCH synopses yield significantly better 
selectivity estimates than conventional XSketchos, especially in the context of 
complex path expressions with branching predicates. 



1 Introduction 

XML has rapidly evolved from a mark-up language to a de-facto standard for data ex- 
change and integration over the web. A testament to this is the increasing volume of 
published XML data, together with the concurrent development of XML query pro- 
cessors that will allow users to tap into the vast amount of XML data available on the 
Internet. The successful deployment of such query processors depends crucially on the 
existence of high-level declarative query languages. There exist numerous proposals that 
cover a wide range of paradigms, but a common characteristic among all XML-language 
proposals is the use of path expressions as the basic method to access and retrieve specific 
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elements from the XML database. A path expression essentially defines a complex navi- 
gational path, which can be predicated on the existence of sibling paths or on constraints 
on the values of visited elements. As a concrete example, in a bibliography database, 
the path expression //author [book] /paper/sigmod/title (which adheres to the 
syntax of the standard XPath language [1]) selects the set of all title data elements 
discovered by the label path //author/paper/sigmod/title, but only for author 
elements that have at least one book child (a condition specihed by the author [book] 
branch). 

Similar to relational optimization, optimizing XML queries with complex path ex- 
pressions depends crucially on the existence of concise summaries that can provide 
effective compile-time estimates for the selectivity of these expressions over the un- 
derlying (large) graph-structured XML database. This problem has recently attracted 
the attention of the database research community, and several techniques [2,3,4,5,6,7, 
8] have been proposed targeting different aspects of the problem. XSketch structural 
graph synopses [5,6] have recently been introduced as an effective data-reduction tool 
that enables accurate selectivity estimates for branching path expressions. In a nutshell, 
XSketch synopses exploit localized graph stability and well-founded statistical assump- 
tions to accurately approximate the path and branching distribution in the underlying 
XML data graph; furthermore, XSKEXCnes can be augmented with summary infor- 
mation on data- value distributions to handle path expressions with value predicates [6]. 
Compared to previously proposed techniques, the XSketch synopsis mechanism targets 
the most general version of the estimation problem: XPath expressions with branching 
and value predicates, over graph-structured, schema-less XML databases. Experimental 
results with a variety of query workloads on different data sets have demonstrated the 
effectiveness of XSKETCHes as concise summaries of XML data. 

In this paper, we introduce /rachonaZ XSKETCHes (fXSKETCHES) a simple, yet in- 
tuitive and very effective generalization of the basic XSketch synopses based on the 
concept of fractional edge stabilities. Briefly, instead of simply recording whether a 
synopsis edge is stable or not (i.e., the conventional “binary” notion of stability em- 
ployed in the XSketch model), our IXSketch synopses record the degree of stability 
for each edge as a. fraction between 0 (“no-connection”) and 1 (“fully stable”). As we 
demonstrate, this natural generalization has a direct positive impact on the underlying 
estimation framework. First, it simplifies the expressions for query-selectivity estimates, 
thus allowing for faster estimation. Second, and perhaps most importantly, it lifts the 
need for certain critical uniformity assumptions during basic XSketch estimation, thus 
resulting in significantly more robust and accurate estimates. Furthermore, the removal of 
such uniformity assumptions also reduces the search space (and, therefore, the time com- 
plexity) of the synopsis-construction algorithm, since it effectively obviates the need for 
specialized synopsis-refinement operations to address regions of non-uniformity. These 
observations are backed up by an extensive experimental study which evaluates the per- 
formance of our generalized fXSKETCH synopses on a variety of XML data sets and query 
workloads. Our results clearly indicate that IXSketches yield significant improvements 
in accuracy when compared to original XSketch summaries. These improvements are 
more apparent in the case of complex path expressions with branching predicates, where 
the uniformity assumptions of the original XSketch model can introduce large errors; 
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<DB> 

<Movies> 

<Movie ID="M1"> 

<ActorRef IDREF="A1’7> 

<ActorRef IDREF="A2"/> 

</Movie> 

<Movie ID="M2"»<ActorRef IDREF="A3"/></Movie> 
</Movies> 

<Actors> 

<Actor ID="Al"XMovieRef IDREF="Ml"/X/Actor> 
<Actor ID="A2"XMovieRef IDREF="Ml"/X/Actor> 
<Actor ID="A3"> 

<Web Link="http : //www . imdb . com/actor?A3"/> 
<MovieRef IDREF="M2'7> 

</Actor> 

</Actors> 

<DB> 



A6 A7 A8 



Al/ 1 ?l K'-, 

AR9 ARlO ARll' ^^12 MR13 MR14 Wl5 

I 1 1 1 1 

IR16 IR17 IR18' IR19 /IR20 IR21 L22 



(a) (b) 

Fig. 1. Example XML document (a) and XML data graph (b). 

fractional stabilities, on the other hand, lift the need for such assumptions thus resulting 
in significantly better fXSKETCH-based selectivity estimates. 

The remainder of this paper is organized as follows. Section 2 covers some prelimi- 
nary material on XML and path expressions, while Section 3 provides a short overview 
of the original XSketch model [5,6]. Our generalized fXSKETCH synopsis model is 
described in Section 4, where we discuss the definition of fractional stabilities and 
their implications on the estimation framework and the synopsis-construction process. 
Section 5 presents the results of our experimental study, while Section 6 gives some 
concluding remarks and our plans for future work. 

2 Preliminaries 

XML Data Model. Following previous work on XML and semistructured data [9,10], 
we model an XML database as a large, directed, node-labeled data graph G = {Vq, Eq). 
Each node in Vq corresponds to an XML element in the database and is characterized by 
a unique object identifier (oid) and a label (assigned from some alphabet of string literals) 
that captures the semantics of the element. (We use label(r;) to denote the label of node 
V G Vg ) Edges in Eq are used to capture both the element-subelement relationships 
(i.e., element nesting) and the explicit element references (i.e., id/idref attributes or 
XLink constructs [11,12,9,13]). Note that non-tree edges, such as those implemented 
through id/idref constructs, are an essential component and a “first-class citizen” of 
XML data that can be directly queried in complex path expressions, such as those allowed 
by the XQuery standard specification [14]. We, therefore, focus on the most general case 
of XML data graphs (rather than just trees) in what follows. 

Example 1. Figures l(a,b) show an example XML document and its corresponding data 
graph. The document is modeled after the Internet Movie Database (IMDB) XML data set 
(www . imdb . com), showing two movies and three actors. The graph node corresponding 
to a data element is named with an abbreviation of the element’s label and a unique id 
number. Note that we use dashed lines to show graph edges that correspond to id-idref 
relationships. 
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XPath Expressions. Abstractly, an XML path expression 1 (e.g., in XQuery [14]) defines 
a navigational path over the XML data graph, specifying conditions on the labels and 
(possibly) the value(s) of data elements. Following the XPath standard [1], a simple 
path expression is of the form \\j\il . . . jin, where the l^’s are document labels. The 
result of the path expression includes all elements for which there exists a document 
path uxjuij . . . jun with label(ui) = 1^. A branching path expression has the form 
i = li[i^]/ . . . / l„[i"], where the li’s are labels and each i* is a (possibly empty) 
label path specifying a branching path predicate at location i. Thus, a branching path 
expression is formed from a simple path expression I 1 /I 2 / . . . /l„ by attaching (one or 
more) branching predicates [l*] at specific nodes in the path. Each [l*] clause represents 
an existential condition, requiring that there exists at least one i* label branch attached at 
point i of the expression. For example, consider the document graph of Figure 1 and the 
simple path expression Actor/MovieRef /IDREF/Movie that retrieves elements with 
ids 4 and 5; if we add a [Link] branch on Actor, then the new path expression Actor- 
[Link] /MovieRef /IDREF/Movie only retrieves element 4. Note that if all branch 
predicates are empty, a branching path expression degenerates to a simple path expression 

It is possible to extend branching path expressions with predicates on the values of 
traversed elements. As an example, the XPath expression Actor [/MovieRef /IDREF/- 
Movie/Title= ’ ’ Snatch’ ’ ] retrieves all actors that have starred in a movie with title 
“Snatch”. In the interest of space, we ignore element values when we discuss the specifics 
of our fXSKETCH synopses, focusing primarily on the label path and branching structure 
of the underlying database - the necessary extensions to handle values and value predi- 
cates follow along similar lines as the analogous extensions for basic XSKETCnes [6]. 

3 A Review of XSketch Synopses 

Synopsis Model. The XSketch synopsis mechanism [5,6] relies on a generic graph- 
summary model that captures the basic path structure of the input XML tree. Formally, 
given a data tree G = {Vc,Ec), a graph synopsis S{G) = (Vs,Es) is a directed 
node-labeled graph, where (1) each node v G Vs corresponds to a subset of element (or, 
attribute) nodes in Vq (termed the extent of v - extent(u)) that have the same label, 
and (2) an edge in {u, v) G Eq is represented in Es as an edge between the synopsis 
nodes whose extents contain the two endpoints u and v. Each synopsis node u stores the 
(common) label label(u.) of all elements in its extent, and an element-count field |u| = 
|extent(M)| (we use u and extent(u) interchangeably in what follows ). Figure 2(a) 
shows a graph synopsis for the document of Figure 1, where elements are partitioned 
according to their label (synopsis nodes are named with the first letter of their label in 
upper case). 

XSketch synopses [5,6] are specific instantiations of the graph-synopsis model 
described above. In order to cover key properties of the path and branching distribution, 
the basic synopsis is augmented with edge labels that capture localized backward- and 
forward-stability [15] conditions across synopsis nodes. An edge u — >■ v is Backward 
(resp., Forward) stable, if all elements in the extent of v (resp., u) have at least one parent 
(resp., child) element in the extent of u (resp., v). As an example, Figure 2(b) shows the 
XSketch summary for the graph synopsis of Figure 2(a). Note that edge A— ^MR is both 
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Fig. 2. (a) Graph Synopsis, (b) XSketch Synopsis, (c) fXSKETCH Synopsis. 



backward and forward stable since all MovieRef s have an Actor parent, and all Actors 
have at least one MovieRef child. As a result, |MR| = 3 is an accurate selectivity estimate 
for path expression A/MR, while | A| = 3 is an accurate estimate for A [/MR] . As shown 
in [5,6], such localized stability information can be combined to effectively capture key 
properties of the global path structure, and provide accurate estimates on the selectivity 
of XPath expressions over the original XML document graph. 

Estimation Framework. The XSketch framework uses the concept of path embed- 
dings in order to estimate the selectivity of any branching path expression. In short, 
a synopsis path u\j . . . juk is called an embedding of the simple path expression 
\\j ... j\k, if label(ui) = 1^ for each \ < i < k. Similarly, a synopsis twig 
u = ui\u^]/ . . . /uk\u^], where u® is a synopsis path, is called an embedding of the 
branching path expression I = li[i^]/ . . . /lfc[l ], if u\/ . . . /uk is an embedding of 
li/ . . . /ifc and is an embedding of 1*, for each \ < i < k. The selectivity of an 
XPath expression can be estimated as the sum of selectivities of its unique embeddings, 
and the problem is thus reduced to estimating the selectivity of a single embedding. 

To estimate the selectivity of a single embedding, the XSketch estimation algo- 
rithms identify sub-components comprising only stable edges (for which accurate esti- 
mates can be given based on edge stability), and then apply statistical (uniformity and 
independence) assumptions at the “breaking” points of the stability chain(s). A detailed 
description of the estimation framework can be found in [5,6]; here, we use a sim- 
ple example to illustrate the basic concepts. Consider the synopsis of Figure 2(b) and 
the path embedding M/AR/IR [/A/MR] . We express the selectivity of the embedding as 
|IR| • f {M / AR/ 1 R[/ A / M B\) , where the /() term denotes the fraction of elements in 
IR that are reached by the embedding. Using the Chain Rule from probability theory, 
this fraction can be written as follows: 

f{MIAR/IR[/AIMR]) = f{M/AR) ■ f{AR/IR \ M/AR) ■ f{IR[/A/MR] \ M/AR/IR) 
= f{M/AR) ■ f{AR/IR I M/AR) ■ f(IR[/A] \ M/AR/IR) ■ 
f{A[/MR] I M/AR/IR[/A]). 

We observe that, by virtue of B-stability, the term f{M/AR) is equal to 1 (all elements 
in IR have a parent in AR); similarly, by virtue of F-stability, the term f{A[/MR] \ 
M/AR/IR[/A]) is also equal to 1 (all elements in A have at least one child in MR). 
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The remaining two terms, however, cannot be computed based solely on available sta- 
bility annotations. To compensate for this lack of path-distribution information, the XS- 
KETCH estimation framework applies an independence assumption, that de-correlates the 
needed fractions from incoming or outgoing paths as follows: f{AR/IR \ M/AR) « 
f{AR/IR), f{IR[/A] I M/AR/IR) « /(/i?[/A]). Thus, our selectivity expression is 
written as: 



f{M/AR/IR[/A/MR]) = f{AR/IR) ■ f{IR[/A]). 

To approximate the required fractions, the XSketch estimation algorithm applies two 
uniformity assumptions based on the stored node counts. Since these assumptions are 
central in the theme of this paper, we define them formally below. 

Al [Backward-Edge Uniformity Assumption]. Given an XSketch node v, the in- 
coming edges to v from all parent nodes m of v such that v is not B-stable with 
respect to u are uniformly distributed across all such parents in proportion to their 
element counts. 

A2 [Forward-Edge Uniformity Assumption], Given an XSketch node v, the outgoing 
edges from v to all children m of v such that v is not F-stable with respect to u are 
uniformly distributed across all such children in proportion to their element counts, 
and the total number of such edges is at most equal to the total of these element 
counts. 

Returning to our example, Assumption Al provides the approximation f{AR/IR) « 
|Ai?|/(|Ai?| -I- \MR\) = 3/6, while Assumption A2 yields f(3 IR/A) « 

|A|/max{|A| + |M|,|/i?|} = 3/6. Thus, the estimate of the fraction is 0.25, and 
the overall path-expression selectivity can be approximated as \IR\ * 0.25 = 1.5 

Overall, the XSketch estimation framework uses stabilities in order to identify 
fully-stable subpath embeddings (for which accurate estimates can be given), and re- 
sorts to statistical assumptions to compensate for the lack of detailed path-distribution 
information in the synopsis. Clearly, the validity of these assumptions directly affects 
the accuracy of the resulting estimates. As we discuss next, the XSketch framework 
addresses this issue during the construction phase, where the goal is to find a “good” 
partitioning of elements into synopsis nodes such that the underlying estimation assump- 
tions are valid [5,6]. 

XSKETCHConstruction. At an abstract level, the XSketch construction problem can be 
defined as follows: Given a documenf graph G and a space budget S, build an XSketch 
synopsis of G that requires at most S storage of units, while minimizing the estima- 
tion error. Given the specifics of the XSketch model, this can be re-stated as follows: 
Compute a partitioning of data elements into synopsis nodes, such that the resulting 
XSketch (a) needs at most S units of storage, and (b) maximizes the validity of the 
estimation assumptions, thus minimizing error. Given that finding an optimal solution 
is typically an AAP-hard problem [5], the proposed XSketch construction algorithm 
(termed BuildXSketch [5]) employs a heuristics-based, greedy search strategy in or- 
der to construct an effective, concise summary. In what follows, we provide a brief 
description of the key concepts behind the BuildXSketch algorithm (more details can 
be found in [5]). 
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BuildXSketch constructs an XSketch summary by incrementally refining a coarse 
synopsis, until it exhausts the available space budget. The starting summary assigns 
elements to synopsis nodes based solely on their label and, thus, represents a very 
coarse partitioning of the input document graph. At each step, this coarse partitioning is 
refined by applying one of three types of refinement operations, namely b-stabilize, 
f-stabilize, and b-split, on a specific synopsis node. Such refinement operations 
split the synopsis node (according to a specific criterion), resulting in a more refined 
partitioning. The split criterion is directly tied to the estimation assumptions that the 
refinement targets. For instance, the b-stabilize operation splits the node so that one 
of the two new nodes becomes B-stable with respect to a particular parent; in this manner, 
B-stability is now guaranteed along the new edge and the new summary is expected 
to have better estimates in the refined area. To select one of the possible refinements 
at each step, the BuildXSketch algorithm employs a marginal- gains strategy: the 
refinement that yields the largest increase in accuracy per unit of additional required 
storage is chosen. Intuitively, this strategy leads to a summary which is more refined 
where correlations are stronger (and, thus, estimation assumptions are less valid), and 
less refined where the independence and uniformity assumptions provide good estimates 
of the true selectivities. 

4 The fXSKETCH Model 

As discussed in Section 3, the basic XSketch model employs the conventional, “binary” 
form of edge stabilities [15] in order to capture the key properties of the underlying path 
and branching structure. If stability is present (i.e., a value of 1), then there are strong 
guarantees on the connectivity between elements of edge-connected synopsis nodes; if, 
on the other hand, it is absent (i.e., a value of 0), then the XSketch estimation framework 
needs to apply independence and uniformity in order to approximate the true selectivity 
of the corresponding edge condition. In this section, we propose a simple, yet powerful 
generalization of the basic XSketch model based on a novel, more flexible notion of 
fractional stabilities. In a nutshell, instead of treating edge stability as a binary attribute, 
our new synopsis model uses a real number which reflects the degree of stability for each 
synopsis edge. In this manner, the synopsis stores distribution information at a finer level 
of detail, thus increasing the accuracy of the overall approximation. 

Before describing our proposed synopsis mechanism in more detail, we present a 
simple example that illustrates the motivation behind fractional stabilities. Consider the 
sample document of Figure 3(a), where the numbers along edges denote the numbers 
of child elements. Figure 3(b) shows the coarsest XSketch synopsis, which groups 
together elements according to their tags (for simplicity, we omit F-stabilities from the 
figure). Note that the basic XSketch estimation framework has to apply a backward- 
edge uniformity assumption (Al) in order to estimate the number of f elements at the end 
of a path expression. Considering the skew in element counts, however, this assumption 
obviously introduces large errors in the estimate; as an example, the selectivity fraction 
of the embedding B/F is estimated as f{B/F) = 1/(1 -F 1 -F 1 -F 1) = 0.25, while 
its true value is only 10“^! In order to capture such skew, the XSketch construction 
algorithm would have to apply successive stabilization operations, in order to separate 
the f elements according to their incoming path. This, however, would lead to a finer 
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partitioning, thus inevitably increasing the storage requirements of the synopsis. In addi- 
tion, stabilization operations mainly target independence assumptions and it is not clear 
if the greedy construction algorithm can actually discover their relation to an invalid 
uniformity assumption (in the sample document, for example, independence is in fact 
valid). Note that, although we focused our example on backward uniformity, a similar 
argument can be made for forward-edges as well (Assumption A2). 

As a solution to this shortcoming of the original XSketch model, we propose to 
store more detailed information along the edges of the graph synopsis in the form of 
fractional stabilities. More formally, these are defined as follows. 

Definition 1. LetS{G) = (V 5 , Es) be a graph synopsis and {u, v) € Es be a synopsis 
edge. The fractional ^-stability of{u, v), denoted as Rq(u, v), is the fraction of elements 
in V that have at least one parent in u. The fractional ¥-stability of (u, v), denoted as 
¥q(u, v), is the fraction of elements in u that have at least one child in v. 

Clearly, our new model of fractional edge stabilities subsumes conventional, binary 
stabilities. More specifically, an edge (u,v) is B-stable (resp., F-stable) if and only if 
Bg(u, u) = 1 (resp., Fg(u, v) = 1), and unstable otherwise. Moreover, it is interesting 
to note that, by definition, fractional stabilities always provide zero-error estimates for 
path embeddings of length 2. For instance, the selectivity of an embedding u/v can be 
accurately estimated as |u| • Bg(u, v), while for the length -2 branch u[u], the selectivity 
is simply equal to |m| • F,(u, u). At this point, we can formally define our novel synopsis 
model of fractional XSKETCHei (/XSketches) as follows. 

Definition 2. An /XSketch summary S{G) = (V 5 , Es) of a document graph G is a 
graph synopsis ofG that records the fractional stabilities Bg(u, v) and¥q{u, v) for each 
edge {u, v) G Es. 
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Fig. 3. (a) XML Document, (b) XSketch, (c) fXSKETCn. 



Figure 3(c) depicts an example IXSketch for the document of Figure 3(a) (the edges 
are annotated with fractional B-stabilities only, since all fractional F-stabilities are equal 
to 1). Obviously, assuming a fixed synopsis-graph structure, an fXSKETCH has increased 
storage requirements when compared against the corresponding XSketch: instead of 
storing only two bits per edge (to denote absence or presence of B/F-stability), each 
fXSKETCH edge needs to record two real numbers (i.e., the new fractional stabilities). 
This finer level of detail, however, allows a concise fXSKETCH summary to capture 
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richer information in limited space, without resorting to a finer partitioning of elements 
to synopsis nodes. Returning to the example of Figure 3, we observe that the LKSketch 
contains accurate information for all paths of length up to 2, while maintaining the 
coarsest partitioning of elements to synopsis nodes; the XSketch, on the other hand, 
will exhibit high estimation errors due to an inaccurate uniformity assumption and, 
as mentioned earlier, can capture skew only by resorting to a much finer partitioning 
of elements. This intuitive advantage of fractional stabilities is corroborated by our 
experimental findings, where our fXSKETCH synopses consistently return significantly 
more accurate selectivity estimates for the same synopsis space budget (Section 5). 

We now discuss how our new fractional stabilities are integrated in the general 
XSketch framework for path-expression selectivity estimation. Once again, the key 
observation here is that fractional stabilities essentially provide zero-error estimates 
for the selectivities of single-edge path embeddings. More formally, if {u, v) is any 
edge in the fXSKETCH synopsis, then the selectivity fractions f{u/v) and f{u[v]) 
are simply defined as f{u/v) — Bg(M, v) and f{u[v]) = Fg{u,v). Thus, instead 
of applying uniformity assumptions to approximate such terms (possibly incurring 
large estimation errors), our FXSketch estimation algorithm can retrieve the accu- 
rate information directly from the stored fractional stabilities. As an example, con- 
sider again the embedding M/AR/IR[A/MR] over the synopsis of Figure 2(b). After 
applying the Chain Rule and independence assumptions, the selectivity is expressed as 
f{M/AR/IR[A/MR]) « f{AR/IR) ■ f{IR[/A]), where AR/IR and IR/A are the 
non-stable edges of the embedding. An XSketch summary would now employ As- 
sumptions A1 and A2 to approximate the needed fraction terms; on the other hand, an 
fXSKETCH makes use of the corresponding fractional stabilities, arriving at the result 
f {M / AR/ 1 R[A/ M R\) « Bg(Ai?, IR) ■ Fg{IR, A). Note that fractional stabilities are 
at least as accurate as XSketch uniformity assumptions and, thus, assuming a hxed 
synopsis graph, the LXSketch estimate is guaranteed to be at least as accurate as the 
XSketch estimate (provided that independence is a valid assumption). Of course, for 
a fixed space budget, the FXSketch synopsis graph is typically smaller (see discussion 
above); nevertheless, our experimental results clearly show that, even in this case, our 
notion of fractional stabilities is a consistent winner. 

Overall, the use of fractional stabilities simplifies the estimation framework by lift- 
ing the assumptions on backward- and forward-edge uniformity. In essence, the only 
assumptions needed by our selectivity-estimation algorithm are basic independence as- 
sumptions that de-correlate a selectivity fraction from other parts of the path embedding. 
The advantage of not applying uniformity assumptions is two-fold. First, estimates be- 
come more accurate since the required selectivity fractions are stored explicitly as frac- 
tional stabilities and need not be estimated (with potentially large errors). Second, the 
estimation process becomes faster, as applying uniformity typically entails scanning the 
parent (or, child) nodes of a synopsis node and computing sums of element counts. 

The removal of the uniformity assumptions has a positive impact on the synopsis- 
construction algorithm as well. In essence, the BuildXSketch algorithm is simplified 
since it only needs to consider b-stabilize and f -stabilize operations; the b-split 
refinement, which specifically targeted uniformity assumptions, becomes redundant and 
can be safely ignored. The end result is that the number of possible refinements per step 
is reduced, thus leading to faster synopsis-construction times. 
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5 Experimental Study 



In this section, we present the results of an extensive experimental study on the perfor- 
mance of the new fXSKETCH model. The goal of this study is two-fold: (a) to evaluate 
the effectiveness of fXSKETCHES as concise summaries for graph-structured XML data, 
and (b) to compare the accuracy of the new summarization model against the original 
XSketch framework. We have conducted experiments on real-life and synthetic XML 
data, using a variety of query workloads. Our key findings can be summarized as follows: 

- The fXSKETCH synopses are effective summaries of graph-structured XML data, 
enabling accurate estimates for the selectivity of complex path expressions. The ex- 
periments show that, an XMark fXSKETCH synopsis achieves an average estimation 
error of 0.8%, with storage requirements that amount to 0.1% of the data size. 

- fXSKETCHES perform consistently better than XSketchbs in terms of estimation 
error. More specifically, higher accuracy is obtained for the same synopsis size, and 
a smaller size is achieved for the same accuracy. For instance, a 5KB fXSKETCH syn- 
opsis of the XMark data set has a 10 fold improvement in accuracy when compared 
to an XSketch synopsis of the same size. In addition, the fXSKETCH framework is 
more robust with respect to workloads that contain numerous branching predicates. 
For instance, the estimation error of a 5 KB XSketch synopsis for the IMDB data 
set can vary of up to 100%, between path expressions with and without branching 
predicates; on the other hand, an FXSketch of the same size has a variation of up 
to 15%. 

- fXSKETCHES compute relatively accurate estimates even for the coarsest synopsis. 
With XMark data set, the use of fractional stabilities in the coarsest summary have 
reduced the estimation error to 9% (compared to 27% for the coarsest XSketch 
synopsis). 

- fXSKETCHES have reduced requirements in terms of construction time. Given a spe- 
cihc space budget, an fXSKETCH is both more efficient to construct and more accurate 
compared to a XSketch. In addition, IXSketches provide accurate estimates even 
for low space budgets, thus leading to shorter construction times. 

Overall, our experimental findings verify the effectiveness of the FXSketch frame- 
work and demonstrate its benefits over the original XSketch proposal. 



5.1 Experimental Methodology 

Implementation. We implemented a prototype of the proposed fXSKETCH framework 
over the existing XSketch code-base. Our implementation encodes fractional stabilities 
in the standard float representation, so each non B- of F-stable edge contributes an 
additional 4 bytes to the synopsis size. Note that XSketch and fXSKETCH synopses of 
the same size are different in the size of the synopes graphs - an fXSKETCH will always 
be the smaller graph (less nodes and edges). 

The parameters of the build algorithm were set to V=10% and P=200 for both 
fXSKETCH and XSketch construction (details on the construction algorithm can be 
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found in the XSketch studies [5,6]). In the results that we report, the maximum size of 
the computed synopses was limited to 2% of the underlying XML data size. 

Data Sets. We use two graph-structured data sets^ in our experiments: IMDB, a real-life 
data set from the Internet Movie Database (www. imdb. com), and XMark, a synthetic 
data set that records the activities of an on-line auction site. Table 1 summarizes the 
main characteristics of the data sets in terms of the file size, and the sizes of the cor- 
responding perfect and coarsest synopsis. The coarsest summary, termed the label-split 
graph, partitions elements to nodes based solely on their label. The perfect summary, 
termed the B/F-Bisimilar graph, partitions elements so that all the edges of the resulting 
synopsis are B- and F-stable (i.e., their fractional stabilities are equal to 1). This property 
guarantees that the selectivity estimate for any branching path expression has zero error. 
Overall, both data sets have large perfect summaries thus motivating the need for concise 
synopses. Note that the sizes reported do not include the space needed to store the actual 
text of the element labels; each label is hashed to a unique integer and the mapping is 
stored in a separate structure that is not part of the summary. 



Table 1. Characteristics of the three data sets Table 2. Average result sizes 





IMDB 


XMark 


File Size 


3 MB 


10 MB 


Number of elements 


102,755 


206,131 


Nodes in Label-Split Graph 


123 


84 


Nodes in B/F-Bisimilar Graph 


49,181 


197,508 


Size of Label-Split Graph 


3.4 KB 


2.1 KB 


Size of B/F-Bisimilar Graph 


1.1 MB 


4.5 MB 





IMDB 


XMark 


Simple 


484 


1125 


Light-Branching 


1351 


1773 


Heavy-Branching 


1331 


2420 



Query Workload. We evaluate the accuracy of the generated summaries against three 
different workloads, each one consisting of 1000 path expressions: (a) Simple Paths, 
which contains simple path expressions only, (b) Heavy Branching Paths, in which 90% 
of path expressions have branching predicates and (c) Light Branching Paths, in which 
40% of path expressions have branching predicates. In each case, all path expressions are 
positive, i.e., they have non-zero selectivity, and are generated by sampling paths from 
the corresponding perfect synopsis. Except for branching predicates, which comprise 
of one or two steps, the length of the sampled paths is distributed between 2 and 5 
and the sample is biased toward high counts in the perfect synopsis. As a result, the 
generated path expressions follow the distribution of the data, with high-count labels 
being referenced more frequently in the query set. Table 2 shows the average result size 
(in terms of the number of elements) for the path queries in each workload. 

We have also experimented with negative workloads, i.e., path expressions that do 
not discover any elements in the data graph. Our results have shown that both XSketch 



* We use the same data sets as the original XSketch study [5] 
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and fXSKETCH summaries consistently produce close to zero estimates with negligible 
error and therefore we omit this workload from our presentation. 

Evaluation Metric. As in the original XSketch study [5], we quantify the accuracy 
of both XSKETCHes and fXSKETCHES based on the average absolute relative error of 
result estimates over path expressions in our workload. Given a path expression p with 
true result size c, the absolute relative error of the estimated count e is computed as 
|e — c| lmax{c, s). Parameter s represents a sanity bound that essentially equates all 
zero or low counts with a default count s and thus avoids inordinately high contributions 
from low-count path expressions. We set this bound to the 1 0-percentile of the true counts 
in the workload (i.e., 90% of the path expressions in the workload have a true result size 
> s). 





Synopsis Size (KB) Synopsis Size (KB) 

(a) XMark Data Set (b) IMDB Data Set 

Fig. 4. XSketch and the fXSKEXCH estimation error for Heavy Branching Paths 



5.2 Experimental Results 

fXSKETCH Performance for Branching Paths. In this experiment, we evaluate the per- 
formance of our fXSKETCH synopses for the heavy-branching paths workload. Figure 4 
depicts the estimation error of fXSKETCHES and XSKETCHes as a function of the synopsis 
size, for the IMDB and XMark data sets. Note that, in all the graphs that we present, 
the estimation error at the smallest summary size corresponds to the label-split graph 
synopsis (i.e., the coarsest summary). Overall, the results indicate that IXSketches are 
effective summaries that enable accurate estimates for the selectivity of branching path 
expressions. In the case of XMark, for instance, the estimation error is reduced below 
1 % after a few refinements and it remains stable thereafter. For the more irregular IMDB 
data set, the estimation error stabilizes at 7% for a space budget of 25KB, which rep- 
resents a very small fraction of the original data size. Compared to XSketchos, it is 
evident that the new fXSKETCH synopses perform consistently better. For the IMDB data 
set, for instance, a 25KB fXSKETCH has an average error of 7%, compared to 28% for 
the XSketch of the same size - a 4 fold improvement. It is interesting to note that the 
fXSKETCH-computed estimates are significantly more accurate even for the case of the 
coarsest synopsis. For the XMark data set, for instance, the average error for the coarsest 
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fXSKETCH is 8% (2.5KB of storage), while the error for the coarsest XSketch is more 
than 25% (2.1KB of storage). 




0 5 10 15 20 25 30 0 5 10 15 20 25 30 

Synopsis Size (KB) Synopsis Size (KB) 

(a) fXSKETCHES (b) XSKETCHeS 



Fig. 5. Estimation accuracy for different branching worklads: (a) fXSKETCHES, (b) XSKETCHes. 

Workload Comparison. Figure 5 depicts the estimation error for XSketch and 
fXSKETCH synopses respectively, for the two different types of branching workloads 
(we note that the numbers for the Heavy Branching workload are identical to Figure 4). 
We report the results for the IMDB data set only, as its structure is more irregular than that 
of the XMARK data set. The plot indicates that fXSKETCHES are more robust in terms 
of their performance than XSKETCHes. As shown in Figure 5(b), the IXSketch error 
follows a similar pattern in both types of workload, with the error of Heavy Branching 
being slightly increased (as expected). XSKETCHes, on the other hand, exhibit signifi- 
cantly different errors depending on the workload type (Figure 5(a)). The increased error 
for Heavy Branching suggests that the forward-edge uniformity assumption is not valid 
in the underlying data graph, thus leading to significant errors as the number of branches 
increases. fXSKETCH estimation, on the other hand, relies on fractional stabilities in 
order to capture accurately the distribution of document edges along each non-stable 
synopsis edge; as a result, an fXSKETCH summary provides better approximations for 
smaller budget sizes. 




Synopsis Size (KB) Synopsis Size (KB) 



(a) XMark Data Set (b) IMDB Data Set 



Fig. 6. XSketch and the IXSketch estimation error for Simple Paths 
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fXSKETCH Performance for Simple Paths. In this experiment, we evaluate the perfor- 
mance of fXSKETCHES for simple path expressions. Figure 6 depicts the estimation error 
of fXSKETCHES and XSKETCHes as a function of the synopsis size, for the IMDB and 
XMark data sets. Similar to the previous experiment, IXSketches provide more accu- 
rate estimates compared to XSKETCHes. For the XMark data set, the estimation error for 
fXSKETCHES stabilizes at 1.7% (5KB of storage), while for XSKETCHes it remains at 
a considerably higher 7% (20KB of storage). The results follow a similar trend for the 
IMDB data set, where the error for fXSKETCHES is reduced to 4% for the 20KB synopsis, 
while the XSketch error stabilizes at 17% for the same space budget. In both data sets, 
and in accordance to our hndings in previous experiments, we observe an improvement 
in accuracy for the coarsest summaries. The difference is more notable in the XMark 
data set - 12% vs. 66% - and comes at a very small increase in the storage requirements 
of the summaries: 2450 bytes for the IXSketch vs. 2100 bytes for the XSketch. 



6 Conclusions and Future Work 

Estimating the selectivity of complex path expressions is a key step in the optimization 
of declarative queries over XML data. In this paper, we have proposed the IXSketch 
model, a generalization of the original XSketch framework to the new concept of 
fractional stabilities. Intuitively, fractional stabilities capture the degree of stability of 
synopsis edges, and essentially free the estimation framework from the, potentially, er- 
roneous, uniformity assumptions. The net result is a simplified estimation framework 
that can provide more accurate estimates with less computation. Results from an exten- 
sive experimental study have verihed the effectiveness of the new model in providing 
low-error selectivity estimates for complex path expressions and have demonstrated its 
benefits over the original XSketch synopses. 

In our future work, we plan to hne-tune certain aspects of the proposed framework. 
More specifically, the current storage overhead of fractional stabilities can reach up to 
30-45% of the total synopsis size, depending on the data set. We intend to investigate 
techniques of reducing this overhead, by selectively choosing which fractional stabili- 
ties to materialize. In essence, this would allow a hybrid model where both fractional 
stabilities and the uniformity assumptions are used during estimation. A second direc- 
tion that we intend to explore is the incremental maintenance of fXSKETCH synopses in 
the presence of data updates, or the refinement of summaries based on query feedback 
(self-tuning synopses). We believe that fractional stabilities are a suitable model for such 
techniques since they record distribution information at a finer level of detail and can 
thus track more reliably the statistical characteristics of the underlying data. 
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Abstract. We investigate workload-directed physical data clustering in 
native XML database and repository systems. We present a practical 
algorithm for clustering XML documents, called XC, which is based on 
Lukes’ tree partitioning algorithm. XC carefully approximates certain as- 
pects of Lukes’ algorithm so as to substantially reduce memory and time 
usage. XC can operate with varying degrees of precision, even in mem- 
ory constrained environments. Experimental results indicate that XC is 
a superior clustering algorithm in terms of partition quality, with only 
a slight overhead in performance when compared to a workload-directed 
depth-first scan and store scheme. We demonstrate that XC is substan- 
tially faster than the exact Lukes’ algorithm, with only a minimal loss in 
clustering quality. Results also indicate that XC can exploit application 
workload information to generate XML clustering solutions that lead to 
major reduction in page faults for the workload under consideration. 



1 Introduction 

Current database and repository systems use two main approaches for storing 
XML documents. The first approach maps an XML document into one or more 
relational tables [4,1]. Stored XML documents are then processed via traditional 
relational operators. The second approach, Native XML Storage, views the docu- 
ment as an XML tree. It partitions the XML tree into distinct records containing 
disjoint connected subtrees [8,3,11]. XML records are then stored in disk pages, 
either in an unparsed, textual form, or using some internal representation. In 
this paper, we concentrate on the native XML storage approach. 

Unlike relational databases where data is queried using value- dependant 
select-project-join (SPJ) queries, XML document processing is dominated by 
path- dependant navigational XPath queries. In practice, such navigational 
traversals are often aided by path indices that reduce the number of path navi- 
gations across stored XML records. However, index-only processing cannot com- 
pletely eliminate such path traversals as XML query patterns are often complex 
and it is impractical to maintain multiple path indices to cover all possible paths 
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of the entire XML document. In reality, disk-resident XPath processors employ 
a mixed, i.e., part navigational, part indexed, processing model. 

Physical co-location or clustering of data items has been extensively used 
in database systems for exploiting common data access patterns to improve 
query performance. Data clustering was shown to be particularly effective when 
the underlying data model is hierarchical in nature and supports queries whose 
execution path traversals remain relatively stable (e.g., IBM’s IMS [10]). In 
native XML database systems, XPath operations result in navigations across 
stored XML records, which are similar to those in hierarchical or object-oriented 
databases. Clustering techniques could also be applied for storing XML docu- 
ments in native XML databases, where related XML nodes can be clustered and 
stored in the same disk page. Here, two XML nodes are related if they are con- 
nected via a link and examining one of them is likely to soon lead to examining 
the other. As disk pages have limited capacity, not all related XML nodes can fit 
in a single disk page. Therefore, one needs to decide how to assign related XML 
nodes to disk pages. This paper investigates the following XML clustering 
problem: Given a static XML document^, mixed XPath processing model, and 
associated navigational workload information, identify related XML nodes and 
efficiently cluster them to disk pages. 

We formulate the XML clustering problem as a tree partitioning problem. 
The tree to be partitioned is a clustering tree, namely an XML tree augmented 
with node and edge weights. We then propose XC, a new tree partitioning algo- 
rithm that can operate with varying degrees of precision under space and time 
constraints. XC is based on LUKES, Lukes’ tree partitioning algorithm [9]. While 
LUKES is an exact algorithm, XC is an approximate algorithm. 

The main contribution of this work is a practical solution to the XML clus- 
tering problem using a tree partitioning approach which uses XML navigational 
behavior to direct its partitioning decisions. A key contribution of this work is 
XC, which has the following advantages over LUKES: 

— First, XC exploits intrinsic structural regularities in Lukes’ dynamic pro- 
gramming algorithm, namely, ready clusters (Section 3), to improve its mem- 
ory consumption and running time, without affecting its precision. With this 
optimization, XC acts as an efficient version of the precise LUKES algorithm. 

— Second, XC implements a parametric approximation of the dynamic pro- 
gramming procedure that allows it to tradeoff precision for time. This en- 
ables XC to exhibit linear-time behavior without significant degradation in 
quality over the precise solution. 

— Third, under memory constraints, XC continuously, on-the-fly adapts its 
precision by selectively eliminating dynamic programming options such that 
the memory limitations are not violated. This allows XC to tradeoff precision 
for memory. 

The other significant contribution of the paper is the experimental validation 
of XC using a prototype XML clustering system, XCS. We used XCS to compare 

^ We don’t consider inter-document clustering. 
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XC against an alternative XML partitioning scheme, workload-directed depth- 
first scan & storage (WDFS). The experiments evaluate the quality of tree par- 
titioning and measure the number of page faults for various XPath query work- 
loads under different paging scenarios using the mixed XPath processing model 
with prefix path indexes. Experimental results demonstrate that XC computed 
approximate partitions were closer in value to the optimal partition computed 
by the precise Lukes’ algorithm, at a fraction of the runtime costs. Similarly, XC 
computed higher valued partitions than WDFS, with a slight overhead in per- 
formance. When clustering using workloads consisting of XPath queries, XC was 
able to compute a partition that matched the XPath navigation behavior. The 
physical layout generated by XC caused fewer page faults, sometimes by more 
than an order of magnitude, than WDFS. XCs ability to use different degrees 
of approximation in practically linear time, its ability to execute under severe 
memory constraints and its ability to capture workload features, even in presence 
of indexes, make it a highly suitable candidate for enabling workload-directed 
XML clustering for both online and offline scenarios. 

The rest of the paper is organized as follows. The tree partitioning problem 
is formalized in Section 2. LUKES is reviewed in Section 2, and in Section 3, we 
present XC. The experiments are detailed in Section 4. Related work is summa- 
rized in Section 5. Section 6 presents conclusions and future work directions. 



2 The Tree Partitioning Problem 

Consider a rooted tree T = (V,E), where P is a set of nodes and E CV x V is 
a set of edges. A cluster over T is a non-empty subset of V. When no confusion 
arises, we simply use the term cluster. A partition of T, , is a set of pair- 
wise disjoint clusters over T whose union equals V, that is = {ci, . . . ,Ck}, 
k > 1, such that IJiLi Ci = V, and Cj fj = 0, for all i yf j. Each edge {i,j) 
of T is associated with a non-negative integer value, Vij. The size of a cluster 
c, size{c), is the sum of the weights of its nodes; formally, size{c) = Si^c w%- 
The value of a cluster c, value{c), is the sum of the values of its edges; formally, 
value{c) = AiGc Vij. The value of a partition P^, value{P'^), is 

the sum of the values of its clusters; formally, value(P^) = value(c). Let 

W, the cluster weight hound, be a positive integer. The tree partitioning problem 
is formulated thus: Find a highest value partition, Pjpi, among all the possible 
partitions of T, such that the size of each cluster in P^j does not exceed W. 
Pjp^ is said to be an optimal partition of T.^ So, P^j={ci, . . . , Cfc} such that 
size(ci) < W, for i = 1, . . . ,k, and value{Pjpf) = M ax{value{P^) 

I P^ is partition of Tand Vc G P^, size{c) < W}. 

Lukes’ Tree Partitioning Algorithm (LUKES): Lukes’ algorithm, LUKES, 
addresses the tree partitioning problem. LUKES operates on a tree in a bottom- 
up manner; it considers a node only after all the node’s children have been 
considered. Consider a partition P’^ of a subtree T' (of T) rooted at node x. 

In general, there may be more than one optimal partition. 



2 
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The unique cluster in which contains x is called the pivot cluster of P^ . The 
weight of a partition P^ is the size of its pivot cluster. LUKES uses dynamic 
programming on the partition weights as follows: For each subtree, say rooted 
at a node x, and for each feasible total cluster size U (i.e., < U < LF), 

LUKES constructs, if possible, an optimal subtree partition in which the pivot 
cluster is of size U. So, LUKES associates with node x a set of partitions, each 
optimal under the constraint that the pivot cluster size is U. When considering 
a node x, the partitions that are associated with each child node of x are used 
to update the collection of partitions, one per each feasible total cluster size, 
for X. Once the tree root node is processed, the final result, Pjpt, is the highest 
value partition associated with the root; as Lukes showed, P^j has the maximum 
partition value among all possible partitions of the tree. 

LUKES: Lukes’ algorithm consists of the following steps: 

1. For each leaf node u with weight w = Wu, form the partition P^,, with value 
0, consisting of a single pivot cluster containing the node u {P^ = {{u}}). 
Mark this partition as the optimal partition for u. For all internal non-leaf 
nodes v, form similar initial partitions, each containing a single cluster with 
value 0 and weight w = Wy. 

2. Arbitrarily choose some node x with weight w = w^, such that all of x’s 
children are now leaf nodes. For the subtree rooted at node x, compute 
for w' = w,w + 1, ■ ■ ■ , W, the optimal partition, if one exists, in which the 
cluster containing x (i.e., the pivot) has weight w' (recall that W is the weight 
bound). To find these optimal partitions, perform the following steps: 

a) Let node i be the first child of node x. 

b) For each partition P of node x and for each partition Q of node i, 
generate intermediate partitions Ii and I 2 as follows: 

i. Let Cx be the pivot cluster of P (containing x) and let Ci be the pivot 
cluster of Q (containing i). 

ii. Merging: If size(cx) + size(ci) < W, then create a new cluster, 
Cm — Cx Uc. i.e., merge the two respective pivot clusters. Create a 
new partition, 

h = {cm}[j(.P - {cxDLKQ - {ci}), i.e., by “gluing” Cm and the 
remaining clusters from partitions P and Q. 

iii. Concatenation: If Q is marked as the optimal partition for node i, 
create a new partition /2 = P IJ Q, i.e., concatenate clusters from 
partitions P and Q to form a new partition. Else, I 2 = {{ }}, with 
value 0. 

c) Update the intermediate partitions associated with node x. Let a (resp., 
b) be the weight of the cluster containing node x in I\ (resp., node i in 
12 )- If the current partition of weight a associated with x, Pa, is such 
that value(Pa) < value{I\) then discard Pa and replace it with Ji. Then, 
if the current partition of weight b associated with x, Pi,, is such that 
value{Pb) < value{l 2 ) then discard Pt, and replace it with l 2 - If i is the 
last child of node x, proceed to Step 3, otherwise, let i be the next child 
of X and go to Step 2b. 
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3. Mark a partition, with the maximum value among the intermediate parti- 
tions associated with node x, as the optimal partition, P^pf, for the subtree 
with root X. Delete the children of node x from tree T. Node x is now a 
leaf. If node x is the root then is the computed optimal partition of T. 
Otherwise, go to Step 2. 

The inner loop (Step 2) is executed for every internal node in the tree. This 
loop involves combining partitions of the internal node with the partitions of 
its children (Step 2b). In the worst case, a parent may possess W intermediate 
partitions after processing its first child (Step 2c); these W partitions can then 
be combined with W partitions of every remaining child. For an n-node tree. 
Step 2b can be invoked at most n — 1 times, leading to an 0(n W^) running 
time. 

3 XC: An Approximate Tree Clustering Algorithm 

We now present a new approximate tree partitioning algorithm, XC, tailored for 
clustering large XML documents. Both LUKES and XC recursively partition a 
clustering tree. Recall that the clustering tree is the XML tree augmented with 
node and edge weights, both integers. Node weights are the (text) sizes of the 
XML nodes. Edge weights model the importance of putting the nodes into the 
same cluster. Once a subtree of the clustering tree is completely partitioned, we 
call it a processed subtree (and its root a processed node). XC provides following 
improvements over LUKES: 

Identification of Ready Clusters: While partitioning a clustering tree, 
LUKES creates new partitions by merging and concatenating clusters from par- 
titions of its processed subtrees. As a result, LUKES retains and propagates a 
larger number of partitions and clusters until the final successful partition is 
computed, leading to excessive memory usage and runtime costs. XC addresses 
this problem by aggressively detecting those clusters that are guaranteed to be 
part of any final partition. Such clusters are called, ready clusters. A cluster in a 
partition associated with a processed node x is ready iff it satisfies the following 
conditions: 

1. It does not contain node x and hence it cannot be involved in any future 
cluster merge operations, and 

2. It is a member of every intermediate partition, including the optimal par- 
tition, associated with node x. 

According to condition (1), since the ready cluster is not a pivot cluster of 
any intermediate partition of a processed subtree, it can not be modified via the 
merging operation (Step 2b:ii). Condition (2) guarantees that every future gen- 
erated partition using the intermediate partitions will contain the ready cluster. 
Hence, the ready cluster can be safely removed from the intermediate partitions 
of the processed subtree without affecting the correctness of the final solution. 
XC detects ready clusters as soon as the subtree rooted at node x is processed 
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and forwards it (along with its XML data) for further processing (e.g., the page 
assignment sub-system, see Section 4.1, Figure 1). Once the page assignment 
sub-system assigns a ready cluster to a page, the XML nodes, contained in the 
cluster, can be deleted along with their XML data. This can lead to significant 
reduction in memory usage and consequently improves the running time. With 
only this optimization, XC behaves like an optimized version of the exact LUKES. 

Approximate Dynamic Programming using Weight Intervals: Since 
LUKES is an exact algorithm, it explores all possible relevant alternatives before 
producing an optimal partition. For a tree with n nodes, for every node n', with 
weight w > 1, LUKES examines all possible partitions with weights from w to the 
weight bound, W. This results in 0(nW) space and O(nW^) time complexities. 
To address this problem, XC partitions the (1, IF) weight range into equal sized 
weight intervals and retains only one partition for each weight interval. The num- 
ber of the weight intervals is specified by the application using the chunksize 
parameter. The value of chunksize can vary from 1 to IF and is a divisor of IF. 
The chosen partition of each weight interval has the maximum value among the 
partitions whose weights fall in that interval. So, for XC, following internal node 
processing, (LUKES, Step 2) there are at most IFc = ^ intermediate 

partitions associated with the node. Similarly, the inner loop (LUKES, Step 2b) 
iterates over the weight intervals from 1 to Wc- These modifications reduce the 
space complexity to 0{n Wc) and time complexity to 0{n W^). This weight in- 
terval approximation technique is used only for assigning intermediate partitions 
in the proper interval. The decision whether to merge clusters (LUKES, Step 2b) 
is still based on actual cluster weights (rather than the interval to which the 
partition containing the cluster belongs). When chunksize = 1, XC operates 
exactly as LUKES; it also finds an optimal partition. When chunksize = IF, 
only one intermediate partition is produced per each node. In other words, when 
chunksize = IF, XC operates as a simple greedy algorithm. In general, by 
changing the chunk_size, one can control the precision of XC’s approximate tree 
partitioning algorithm. 

Handling Memory Constraints: In spite of the memory reduction techniques 
employed by XC, its memory consumption can still be very large, particularly for 
large XML documents and for smaller chunk_size values. We now describe the 
technique employed by XC which uses ready sub-partitions, to work effectively 
subject to a memory limit. A ready sub-partition is a partition that is associated 
with the root of an already processed subtree (of the clustering tree) , which is a 
subset of the computed best approximate partition for the whole clustering tree. 
If one could identify a ready sub-partition, then the sub-partition’s clusters can 
be forwarded for further processing and the memory of associated data structures 
could be reclaimed immediately. Observe that the elimination of partitions also 
reduces the number of options for future dynamic programming considerations] 
further reducing the computation and memory usage. XC uses the following 
mechanism to choose a partition as the next ready sub-partition. Once a subtree 
is processed, its intermediate partitions are retained until its parent node is 
processed. XC maintains a bounded- length (a parameter k, e.g., k = 8) list, HL, 
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of the current up to k highest value partitions that are associated with processed 
nodes (i.e., roots of processed subtrees whose parents are not yet processed). 
Given a memory limit M, XC computes high and low “water marks” for managing 
memory usage (e.g., low = 0.7 M < high = 0.9M < M). When memory usage 
crosses the high water mark, corrective actions are triggered. Once it reaches the 
low water mark, normal operation resumes. Upon crossing the high water mark, 
XC chooses a partition, P, with the highest value from HL. XC then marks this 
partition as a ready sub-partition, forwards its clusters to the page assignment 
sub-system, and then reclaims the memory of the associated data structures. 
Observe that P will also form a subset of the final result partition. XC then 
removes the XML node vp with which P is associated (i.e., a root of a processed 
subtree) and discards vp’s link to its parent. This process of eliminating highest 
value partitions from HL proceeds continuously until either the memory usage 
falls below the low water mark or the optimal partition list is empty. If the 
memory-usage violation persists and the optimal partition list is empty, then, as 
soon as any subtree is processed, its optimal partition is immediately marked as 
ready and is eliminated, in the process described above. 



4 Experimental Evaluation 

We ran XC with different degrees of precision and compared it against WDFS, 
a workload-directed depth-first scan and store scheme [12].^ WDFS is a natural 
clustering algorithm that scans an XML document in depth-first manner. In a 
greedy fashion, it produces XML nodes and uses the workload information (i.e., 
the edge weights) to store nodes connected by edges with higher weights in the 
same cluster. The cluster is mapped into disk pages as soon as it is full. A major 
advantage of WDFS is that it places together XML nodes whose processing 
completes in temporal closeness. Another advantage of WDFS is that it is ideal 
for implementation in an online single-pass environment. 



4.1 XCS: A Prototype XML Clustering System 

For experimental purposes, we partitioned several XML documents from the 
University of Washington XML repository using XCS (Figure 1). XCS is a 
prototype XML clustering system that scans an XML document, partitions it 
into clusters and assigns the clusters to (disk) pages, all in a single pass. It 
consists of three distinct sub-systems: edge-weight assignee, tree partitioner and 
page assignee. 



Edge- weight Assignee: XCS’s edge- weight assignee uses application workload 
information to assign weights to the clustering tree edges. Currently, the work- 
load information consists of a list of XPath queries and their relative weights. 

® Interested reader is referred to [2] for complete experimental results. 
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Fig. 1. XCS Architecture. 



These weights capture relevant workload features such as the frequency or im- 
portance of specific XPath queries. XCS’s edge- weight assigner uses a simula- 
tor that mimics execution of a hypothetical XPath processor using the mixed 
XPath processing model. Currently, in addition to the traditional navigational 
approach, the simulator also supports XPath processing using prefix path in- 
dexes on a query’s sub-path.^ While executing an XPath query, those edges 
covered by the path indexes, the simulator assigns the weight 0 (which mimics 
the index-directed “jumping over” operation), the remaining edges, which are 
used in navigation, are assigned the corresponding relative weight. Intuitively, 
the assigned weight models the fact that the nodes connected by that edge are 
traversed in (temporal) succession by the XPath processor, and using that edge. 
The edge weights represent cumulative weight increments generated by traver- 
sals for the XPath queries in the workload. The higher is the edge weight, higher 
is the traversal affinity between the connected nodes. 



Tree Partitioner and Page Assigner: Once the edge-weight assigner assigns 
the edge weights, the memory resident portion of the clustering tree is handled 
by the tree partitioning system (which can use either XC or WDFS). As soon 
as the tree partitioner identifies clusters suitable for storage, such clusters are 
forwarded to the page-assignment sub-system which maps them to pages. When 
XC is used as the tree partitioner, the page assignment sub-system uses an online 
bin-packing algorithm that makes page assignment decisions solely on the basis 
of weights of the currently available ready clusters (Section 3). The WDFS uses 
a greedy strategy to map clusters to pages. 

4.2 Evaluation of XC 

We first investigate the effect of chunk_size on the quality (in terms of the parti- 
tion value) and performance of XC. Recall that XC uses dynamic programming 
over weight intervals rather than over the entire weight range (see Section 3). 
In XC, the weight interval is determined by the chunk_size parameter (which 
determines the number of dynamic programming intervals). Table 1 illustrates 
the effects of the chunk_size parameter on the final partition value and the total 

^ It could be easily extended to support generalized path indexes and text indexes. 





212 



R. Bordawekar and O. Shmueli 



execution time. In this experiment, an XML file, xmark.xml of size 110 KB, 
and containing 2647 XML nodes, is clustered with a weight bound, W=1 KB. 
The chunk_size, c, is varied from 1024 to 1. Clustering tree edges are randomly 
assigned weights. As illustrated in Table 1, as chunk_size decreases, the value of 
the final partition solution increases. As the chunk_size decreases, the number 
of weight intervals used, — , increases and XC becomes more precise. However, 
as XCs precision increases, the amount of computation grows (recall that the 
time complexity of XC is 0{n (^)^)), leading to an increase in memory usage 
and running time. When the chunk_size is 1, XC behaved like the exact LUKES, 
and the required running time was more than 7 hours. The difference in solution 
values for the chunk_sizes of 1024 and 1, is very small (< 2%), but the execution 
times varied drastically (432 ms. for 1024 vs. more than 7 hours for 1). This 
indicates that XC’s strategy of dynamic programming over weight intervals and 
retaining the highest valued partition per weight interval, works well in practice. 
As the results demonstrate, XC with chunk_size of 1 (equivalently, LUKES), ex- 
hibits unrealistic memory and runtime requirements and for varying chunk_sizes, 
XC obtains good solutions with reasonable memory and runtime usage. 



Table 1. Effect of Dynamic Programming over Weight Intervals on XC using xmark.xml 
of size 110 KB. The weight bound was 1 KB and the chunk_size is varied from 1024 to 
1. When the chunk_size was 1, XC acted as an optimized version of LUKES. 



chunk_size 


Value 


Memory Usage 
(bytes) 


Time 

(ms) 


1024 


20244 


49904 


432 


512 


20355 


62581 


423 


256 


20430 


85603 


590 


128 


20442 


116393 


1171 


64 


20460 


176740 


3040 


32 


20460 


289407 


10687 


16 


20460 


523074 


44974 


8 


20460 


1012078 


195832 


4 


20460 


1910695 


960775 


1 (LUKES) 


20460 


7625395 


28610786 



We measured the impact of identifying ready clusters on the quality of results. 
We clustered 5 XML documents with and without using the ready clusters. The 
edges were assigned random weight. We found that for all 5 documents, the 
memory usage increased when the ready clusters were not identified. In many 
cases, that resulted in a significant degradation in the overall running time. 
For example, while partitioning mondial, xml of size 1.97 MB, XC used 2227547 
bytes and needed 41.1 sec when ready clusters were identified; without the ready- 
cluster identification, the same document used 6059854 bytes and required 156.91 
sec. Ready cluster identification reduces the number of data structures that need 





Flexible Workload- Aware Clustering of XML Documents 213 



to be retained and propagated during the clustering process. It doesn’t reduce 
the number of iterations (which is determined by the chunk_size) but it reduces 
the amount of work per iteration as the amount of data to be handled decreases, 
and improves overall memory behavior by reducing memory usage and memory 
footprints of key data structures. It should be noted that for all cases, regardless 
of the ready cluster identification, the final solution value was the same. 

We also ran an experiment where we clustered an XML document using a 
clustering tree that didn’t store the XML data values. We found that the overall 
running time was similar to when the clustering tree stored the data values. As 
stated earlier, XC uses an incremental garbage collector to detect and remove 
dead objects quickly. This, along with the ready cluster identification, makes 
the clustering application computationally bound. Hence, once the ready cluster 
identification is used, the number of iterations has the most impact on the overall 
running time. 



Table 2. Evaluating Memory-Constrained Execution of XC using chunk_size of 4 KB 
on mondial. xml. As the available memory is reduced, maximum memory utilization 
and running time improve without significant changing the final solution value. 



Memory Limit 
(bytes) 


Memory Usage 
(bytes) 


Value 


Time 

(ms) 


(X) 


1425850 


8388200 


32511 


1000000 


900576 


8353100 


37305 


500000 


452095 


8207900 


32070 


100000 


91264 


8138600 


17794 


50000 


47247 


7996000 


13164 


30000 


25815 


7817300 


9049 



Table 2 presents the behavior of XC under various specified memory usage 
limits. We clustered an XML file, mondial . xml, of size 1.97 MB and containing 
154855 nodes, using random edge weights. We used the weight bound of 4 KB 
and ran the experiment, first using a chunk_size of 4096 and without any memory 
limits. We then repeated the experiment with smaller memory limits (we varied 
the amount of available memory from 1 MB to 30 KB). The results demonstrate 
that XC performs well even under extreme memory constraints. Furthermore, 
as the amount of available memory is reduced, XC degrades gracefully (values 
of the resultant partitions for the two extreme cases only differ from the best 
approximate solution by 6%). With tighter memory limits, execution time drops 
from 32.5 seconds to 9.049 seconds. 

The next experiment compares the effectiveness of XC with that of another 
tree partitioning algorithm, WDFS. We first describe clustering results obtained 
based on workload-based edge weight assignment. In this experiment, we consider 
the extreme case in which the workload consists of a single XPath query on 
mondial . xml: 
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Table 3. Partitioning mondial. xml nsing workload-based edge weights. XC computes 
better partitions than WDFS, but at a higher runtime cost. 



chunk_size 

(bytes) 


Value 


Time 

(ms) 


II WDFS II 


II 4096 


1 279000 


11540 II 


II XC II 


4096 


410500 


15233 


1024 


412800 


17608 


512 


415600 


21124 


256 


423700 


30640 


128 


423800 


53818 


64 


432900 


132149 


32 


423900 


534455 



/mondial/country/province/city. We considered the execution of this query 
by a hypothetical XPath processor. The edges of the XML tree that would be 
traversed during such an execution were assigned weight 100. The document 
was then partitioned by WDFS and XC using various chunk_sizes. The results 
are illustrated in Table 3. In all cases, we see a clear advantage for XC over 
WDFS in terms of partition values. 

4.3 Evaluating Using the Prototype Clustering System, XCS 

While the previous experiments examine XC’s usefulness as a clustering algo- 
rithm which produces high value partitions in reasonable space and time, this 
sub-section investigates the usefulness of the resulting clustering for applications. 
To this end, we tested how the resulting page layouts affect query processing. 

The testing was performed in the context of the prototype clustering system, 
XCS (Section 4.1). For these tests, we considered the following query workload 
on mondial . xml, consisting of four XPath queries: 

(Ql) /mondial/country/province [2] /city 150 
(Q2) /mondial/country [//enthicgroups] 100 

(Q3) /mondial/country/province/city [longitude or lattitude] 150 
(Q4) /mondial/organization/members 100 

Queries Ql and Q3 were assigned the weight, 150, and queries Q2 and Q4 
were assigned the weight, 100. Queries Ql and Q2 were executed using a prefix 
path index on /mondial/country and the query Q3 was executed using another 
prefix path index on /mondial/country/province. We clustered mondial. xml 
in XCS using XC and WDFS as the tree partitioning algorithms. We first used 
the combined workload consisting of 4 queries and then repeated the experiments 
for individual workloads consisting of a single query. We used a weight bound of 
4 KB for all experiments and for XC, we used a chunk_size of 1 KB. 
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We simulated the execution of each query using the mixed XPath processing 
model under different clustering scenarios. We generated traces of XML nodes 
traversed while executing the XPath queries with and without path indices. 
These traces, along with the page layouts for different clusterings, were used for 
evaluating the paging behavior. We modified two important parameters to vary 
the paging scenarios. The first, buffer size is the number of disk pages that can be 
held in memory. The second, prefetch quantity, is the number of contiguous disk 
pages that are fetched in a single disk access (starting at the page addressed and 
including consecutive pages) . The first scenario used a buffer size of 1 page with 
a 1 page prefetch (denoted by 1/1); the second scenario considered infinite buffer 
size (i.e., a fetched page is retained in memory for the duration of the execution) 
and 1 page prefetch (denoted by Inf/1); the final scenario simulated a realistic 
situation of using a buffer with 64 pages with a 2 page prefetch (denoted by 
64/2). This scenario used a simple LRU scheme for discarding in-memory pages. 
In all cases, we assumed a page size of 4 KB. 



Clustering mondial. xml (WDFS and XC) 



Evaluating Individual Queries, 3 Paging Scenarios 




Fig. 2. Page faults for WDFS- and XC-based Clustering for combined and individual 
query workloads. The weight bound was 4 KB and the chunk_size was 1 KB. Execution 
of queries Ql, Q2, and Q3 used prefix path indexes. Query Q4 used navigation over 
parent-child edges. 



Figure 2 represents the page fault measurements for WDFS- and XC-based 
clustering using combined (namely, the clustering was based on all four queries) 
and individual (namely, the clustering was based on a single query) workloads. As 
demonstrated by the results, in all cases, XC caused fewer page faults than WDFS. 
While WDFS was able to exploit local parent-child traversal affinities, XC was 
able capture and exploit traversal affinities over paths of the abstract XML tree. 
Although the best paging performance was observed when the document was 
clustered solely based on the single workload query, the paging performance of 
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individual queries didn’t degrade substantially when the document was clustered 
using the combined query workload. The biggest gain was observed for the query, 
Q2, which was assigned the weight 100. When the combined workload was used, 
XC generated a clustering that favored the queries Q1 and Q2, which had heavier 
weights. In contrast, query Q4 exhibited no improvement when the individual 
query workload was used. Query Q4 did not traverse the same region as the 
remaining three queries and hence, it solely determined the partitioning of its 
region, even under the combined workload. This result demonstrated that XC 
was able to compute the final solution that matched both the combined traversal 
pattern and individual traversal patterns. 



(a) Impact of XC Chunk_size on Values 



(b) Impact of XC Chunk_size on Performance 




Fig. 3. Impact of XC’s chunk_size of the final partition value (a) and the overall ex- 
ecution time (b). mondial . xml was partitioned using the combined and individual 
workloads. The weight bound was 4 KB. Execution of queries Ql, Q2, and Q3 used 
prefix path indexes. Query Q4 used navigation over parent-child edges. 



We also studied the impact of XC’s precision on the clustering behavior. We 
clustered mondial . xml using the combined workload with the weight bound, 
W= 4 KB, and reran the experiment by varying the chunk_size from 4096 to 
32. We found that 32 was the smallest chunk_size for mondial .xml for which XC 
could compute the final solution in a reasonable amount of time. Figures 3(a) 
and (b) illustrate the impact of XC’s chunk_size on the final partition value 
and the overall execution time. As Figure 3(a) shows, apart from the combined 
and individual Q4 workloads, the chunk_size had no effect on the final partition 
value. The query Q4, /mondial/organization/members, traverses a region of 
the XML tree that is closer to the root. Since XC is a bottom-up algorithm, 
subtrees closer to the root are processed in the later stages of the algorithm. 
Hence, a more precise version of XC is needed to exploit local affinities in those 
subtrees that could be processed later in the algorithm. As shown in Figure 3(a), 
the difference in values corresponding to the maximum and minimum chunk_sizes 
(i.e., 4096 and 32) was extremely small (< 1%). However, as the chunk_size was 
reduced, XC’s running time increased by an order of magnitude (Figure 3(b)). 
These results illustrate that for smaller chunk_sizes (i.e., at higher precision), XC 
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can compute a better final partition (as shown for the query Q4), but at higher 
precision, XC always requires large running times. These results also indicate 
that clustering at higher precision may not always lead to an improvement in 
the number of page faults. 

5 Related Work 

Determining if there exists a tree partition with a value at least v is NP-complete 
in the ordinary sense (Problem ND15, acyclic partitioning) [5]. Johnson and 
Niemi propose two algorithms for this problem [7]. Both exhibit higher running 
times than LUKES. 

Since early days of database development, physical clustering of related data 
items has been examined for exploiting data access patterns. An early re- 
ported application of clustering for physical database design was in hierarchical 
databases [10]. Object-oriented databases (OODBs) have used clustering of logi- 
cally related objects for achieving better physical object placement [12,6]. Object 
clustering in OODBs has not been widely accepted primarily due to the complex- 
ities of object-oriented programming (e.g., dynamically changing object access 
patterns) and problems with effectively partitioning the resultant object graph. 
OODBs traditionally use graph partitioning algorithms for partitioning a clus- 
tering graph whose node weights represent object sizes and edge weights denote 
access behavior. Such algorithms exhibit large space and time requirements. Fur- 
thermore, these graph partitioning techniques are not suitable for partitioning 
trees as they can’t exploit structural aspects of trees for making partitioning 
decisions. Also, these algorithms assume the entire graph to be generated before 
partitioning and need multiple in-memory passes over the generated graph. In- 
terestingly, [6] uses Lukes’ tree partitioning algorithm when the clustering graph 
is a tree and concludes that it is not useful in real applications due to its large 
memory usage. 

In [3], Fiebig et al. describe Natix, a native XML system that splits XML doc- 
uments into multiple subtrees (records) such that each subtree record can fit into 
a page. Their splitting algorithm uses a split matrix to determine which nodes 
that should be always stored together. We believe a schema-agnostic workload- 
directed clustering algorithm like XC can be used to generate tree partitions and 
these partitions could be then integrated into the Natix-like record structure. 



6 Conclusions and Future Work 

In this work, we presented XC, a new approximate tree partitioning algorithm, 
that is tailored for clustering static XML documents. We showed that XC im- 
proves on Lukes’ tree partitioning algorithm. XC is an approximate tree parti- 
tioning algorithm, exhibiting running time that is linear in the size of the input 
file. XC can even operate within a memory limit. XC can be used in varying 
degrees of precision; higher precision could potentially compute better solutions, 
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although with higher memory and runtime costs. Extensive experimental evalu- 
ation demonstrated that workload-directed clustering of XML document works 
well in practice. Furthermore, clustering of XML documents using XC is supe- 
rior to clustering based on the WDFS, an alternative workload-directed clustering 
scheme. This validates our formulation of the XML clustering problem as a tree 
partitioning problem. XC was able to capture and exploit XML navigational 
workload affinities over paths of the abstract XML tree. Our current work in- 
volves extending these techniques for clustering XML documents in scenarios 
where either the workload is changing dynamically or the XML document is 
structurally updated. We also plan to investigate extensions to XC to exploit 
affinities across sibling edges. 
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Abstract. XML indices are essential for efficiently processing XML queries 
which typically have predicates on both structures and values. Since the num- 
ber of all possible structural and value indices is large even for a small XML 
document with a simple structure, XML DBMSs must carefully choose which 
indices to build. In this paper, we propose a tool, called XIST, that can be used by 
an XML DBMS as an index selection tool. XIST exploits XML structural infor- 
mation, data statistics, and query workload to select the most beneficial indices. 
XIST employs a technique that organizes paths that evaluate to the same result 
into structure equivalence groups and uses this concept to reduce the number 
of paths considered as candidates for indexing. XIST selects a set of candidate 
paths and evaluates the benefit of an index for each candidate path on the basis of 
performance gains for non-update queries and penalty for update queries. XIST 
also recognizes that an index on a path can influence the benefit of an index on 
another path and accounts for such index interactions. We present an experimental 
evaluation of XIST and current XML index selection techniques, and show that the 
indices selected by XIST result in greater overall improvements in query response 
times. 



1 Introduction 

An XML document is usually modeled as a directed graph in which each edge repre- 
sents a parent-child relationship and each node corresponds to an element or an attribute. 
XML processing often involves navigating this graph hierarchy using regular path ex- 
pressions and selecting those nodes that satisfy certain conditions. A naive exhaustive 
traversal of the entire XML data to evaluate path expressions is expensive, particularly in 
large documents. Structural join algorithms [1,8,27] can improve the evaluation of path 
expressions, but as in the relational world, join evaluation consumes a large portion of 
the query evaluation time. Indices on XML data can provide a significant performance 
improvement for path expressions and predicates that match the index. However, an 
index degrades the performane of update operations and requires additional disk space. 
As a result, determining which set of indices to build is a critical administrative task. 
These considerations for building indices are not new, and have been investigated ex- 
tensively for relational databases [24,3]. However, index selection for XML databases 
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is more complex due to the flexibility of XML data and the complexity of its structure. 
The XML model mixes structural tags and values inside data. This extends the domain 
of indexing to the combination of tag names and element content values. In contrast, 
relational database systems mostly consider only attribute value domains for indexing. 
Moreover, the natural emergence of path expression query languages, such as XPath, 
further suggests the need for path indices. Path indices have been proposed in the past for 
object-oriented databases and even as join indices [23] in relational databases. Unlike 
relational and object-oriented databases, XML data does not require a schema. Even 
when XML documents have associated schemas, the schemas can be complex. An XML 
schema can specify optional, repetitive, recursive elements, and complex element hier- 
archies with variable depths. In addition, a path expression can also be constrained by 
the content values of different elements on the path. Hence, selecting indices to speed 
up the evaluation of XML path expressions is challenging. 

This paper describes XIST, a prototype XML index selection tool that uses an inte- 
grated cost/benefit-driven approach. The cost models used in this paper are developed 
for a prototype native XML DBMS that we are building. As in other native XML sys- 
tems, this system stores XML data as individual nodes [27,17,12,7,14], and also uses 
stack-based join algorithms [1,8,2] for evaluating path expressions. However, the gen- 
eral framework of XIST can be adapted to systems with other cost models by modifying 
the cost equations that are presented in this paper. 



1.1 Contributions 

Our work makes the following contributions: 

- We propose a cost-benefit model for computing the effectiveness of a set of XML 
indices. In this cost-benefit analysis, we account for the index update costs and also 
consider the interaction effect of an index on the benefit of other indices. By carefully 
reasoning about index interactions, we can eliminate redundancy computations in 
the index selection tool. 

- When the XML schema is available, XIST uses a concept of structure equivalence 
groups, which results in a dramatic reduction in the number of candidate indices. 

- We develop a flexible tool that can recommend candidate indices even when only 
some input sources are available. In particular, the availability of only either the 
schema or the user workload is sufficient for the tool. 

- Our experimental results indicate that XIST can produce index recommendations 
that are more efficient than those suggested by current index selection techniques. 
Moreover, the quality of the indices selected by XIST increases as more information 
and/or more disk space is available. 

The remainder of this paper is organized as follows. Section 2 presents data models, 
assumptions, and terminologies used in this work. In Section 3, we describe the overview 
of the XIST algorithm. Sections 4, 5, and 6 describe the individual components of XIST in 
detail. Experimental results are presented in Section 7, and the related work is described 
in Section 8. Finally, Section 9 contains our concluding remarks and directions for future 
work. 
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(1) required 

/ \ ^ 

article (2) book (3) opbonal 



(4) title 




publisher (9) 



(6) first (7) 

Fig. 1. Sample XML Schema 



2 Background 

In this section, we describe the XML data models, terminologies, and assumptions that 
we use in this paper. 

2.1 Models of XML Data, Schema, and Queries 

We model XML data as a tree. We encode the nodes of the tree using Dietz’s numbering 
scheme [1 1,10]. Each node is labeled with a pair of numbers representing its positions 
on preorder and postorder traversals. Using Dietz’s proposition, node x is an ancestor 
of node y if and only if the preorder number of node x is less than that of node y and the 
postorder number of node x is greater than that of node y. This basic numbering scheme 
can be extended to include an additional number that encodes the level of the node in 
the XML data tree. This numbering scheme is used in our cost models when performing 
the structural joins [27,8,1] between parents and children or between ancestors and 
descendants. We need the additional level information of a node to differentiate between 
parent-child and ancestor-descendant relationships. 

We model an XML schema as a directed label graph. An XML schema is written 
in the W3C XML Schema Definition Language which describes and constrains the 
content of XML documents. It allows the definition of groups of elements and attributes. 
We can group elements sequentially, or choose some elements to appear and others to 
disappear, or define an unordered set of elements. An edge between each node in the 
XML Schema graph here is the edge between a parent element and a child element while 
child elements are grouped sequentially. The required edge from node A to node B refers 
that the minimum number of occurrences of B that appear inside node A is at least one; 
on the other hand, the optional edge refers that the minimum number of occurrences of 
B that appear inside node A is zero. Figure 1 shows the schema of a sample bibliography 
database that we use as a running example throughout this paper. 

2.2 Terminologies and Assumptions 

We now define terminologies for paths and path indices that are used in this paper. 
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A label path p {referred to as a path as well) is a sequence of labels ■■■jlk where 

the length of the path is k. We assume that the returned result of path p is the ending 
node of path p. Path pd is dependent on path p if p is a subpath of pd- For example, path 
hlhl ■■■Ih is dependent on path l\jli- 

A path index (PI) on path p is an index that has p as a key and returns the node IDs 
(NIDs) of the nodes that matched p. (As discussed in Section 2.1, an NID is simply a 
triplet encoding the begin, end, and level information.) 

An element index (El) is a special type of a path index. Since an element “path” 
consists of only one node, the element index stores only the NIDs of the nodes matched 
by the element name. 

A candidate path ( CP) is a path on which XIST considers as a candidate for building 
an index. The corresponding index on the candidate path is called a candidate path index 
(CPI). 

In our work, we consider the following types of indices as candidates: (i) structural 
indices on individual elements, (ii) structural indices on simple paths as defined above, 
and (iii) value indices on the content of elements and attribute values. It is possible to 
extend our models to include other types of indices, such as an index on a twig query, 
in the future. 



3 The XIST Algorithm 

In making its recommendations for a set of indices, XIST is designed to work flexibly 
with the availability of a schema, a workload, and data characteristics of an XML data 
set. Figure 2 shows an overview of XIST, which consists of four modules that adapt 
to a given set of input configuration. The first module is the candidate path selection 
module, which eliminates a large number of potentially irrelevant path indices. It uses the 
following two techniques: (i) If the query workload is available, this module eliminates 
paths that are not in the query workload, and (ii) If the schema is available, the tool 
identifies and prunes equivalent paths that can be evaluated using a common index. 

To compute the benefits of indices on candidate paths, we use either the cost-based 
benefit computation module or the heuristic-based benefit computation module, depend- 
ing on the availability of data statistics. When data statistics are available, the cost-based 
benefit computation module is employed. When data statistics are not available, the 
heuristic-based benefit computation module is operated instead. (The computation car- 
ried out by the benefit computation modules may be present in many optimizers, and 
these modules could be shared by the query optimizer.) 

The last module is configuration enumeration, which in each iteration chooses an 
index from the candidate index set that yields the maximum benefit. The configuration 
enumeration module continues selecting indices until a space constraint, such as a limit 
on the available disk space, is met. 

Figure 3 presents the overview of the XIST algorithm. In the following sections, we 
describe in detail the steps shown in this figure. The first phase, the selection of candidate 
paths (CPs), is presented in Section 4. Section 5 discusses the benefit computations for 
the indices on the selected CPs (CPIs). Finally, Section 6 describes the re-computation 
for the benefits of CPIs that have not been chosen (line 9). 




XIST: An XML Index Selection Tool 



223 




Fig. 2. The XIST Architecture 

4 Candidate Path Selection 

In this section, we address the important issue of selecting candidate paths (CPs). Since 
the total number of candidate paths for an XML schema instance can be very large, for 
efficiency purposes, it is desirable to identify a subset of the candidate paths that can 
be safely dropped from consideration without reducing the effectiveness of the index 
selection tool. The candidate path selection module in XIST employs a novel technique 
to achieve this goal. 

Our strategy for reducing the number of CPs is to share an index among multiple 
paths. Our approach for grouping similar paths involves identifying paths that share the 
same ending nodes. Since the ending nodes of these paths are the same, the index on a 
single path of the paths in this group returns the same set of nodes that other paths will 
(assume that the returned nodes of the path are only the ending nodes of the path). 

Definition 1. A path pu, nxlriij ...Irik, a unique path if there is one and only one 
incoming required edge to rii and there is one and only one incoming required edge to 
node Hi which must be exclusively from node rii-ifor i = 2 to k. 



Definition 2. Path pi and path p 2 are in the same structure equivalence group if 1) p\ 
and p 2 share the same suffix subpath, 2) the starting node of the shared suffix subpath 
must have only one incoming edge, and 3 ) the non- suffix subpath is a unique path. 

We refer to a group of paths that share the same ending nodes as a Structure 
Equivalence Group (SEG). As an example of an SEG consider the schema shown in 
Figure 1. A sample SEG in this schema is the set containing the following paths: 
bib/book/publisher, book/publisher, and publisher. For brevity, we refer to 
the paths using the concatenation of the first letter of each element on the path (for 
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Inputs: An index space constraint k (which can he the available disk space [default], 
or the number of indices), an XML schema, a query workload, and optional data statistics. 

The algorithm requires at least the schema or the workload. 

Output: A set of recommended path indices, S 

XISTO 

// Phase 1: Candidate Path Selection 

1 . use the XML schema or the query workload to compute the target workload W. 

2. choose paths and subpaths in the target workload W to form the set of candidate paths (CPs). 

// Phase 2: Benefit Computation 

3. for each CP p, compute the benefit of the corresponding CPI: Ip. The benefit is B(Ip) 

// Phase 3: Configuration Enumeration 

4. initialize the set of selected indices, S to Ie where Ie is the set of element indices. 

5. while (|S| < k) 

6. select p € CP and Ip ^ S such that B{Ip) is the maximum. 

7. CP = CP-p 

8. S = SUlp 

9. 3p G CP, recompute the benefits of candidate path indices, B(Ip) 

10. endwhile 



Fig. 3. The XIST Algori t hm 

example, we refer to bib/book/ptiblisher as bbp). As shown in the schema graph in 
Figure 1 , bbp is a unique path as p has only one parent, and each of its ancestors also has 
only one parent. Nodes that match bbp are the same as nodes that match the suffix paths 
of bbp which are bp and p. Thus, these paths are in the same SEG. However, suppose 
that if publisher has two parents: book and article. Then, bib/book/publisher, 
book/publisher and publisher do not form a SEG since we cannot assume that 
publisher is the publisher of the book (book/publisher). The publisher can also 
be the publisher of the article (book/article). On the other hand, book/publisher 
and /bib/book/publisher form a SEG. In this SEG, the shared suffix subpath is 
book/publisher 

Instead of building indices on each path in an SEG, XIST only builds an index on the 
shortest path in each SEG. We choose the shortest path because the space and access time 
of indices in SEGs can often be reduced. This is because the shortest path can simply 
be a single element. In such an index, we only need to store three integers (begin, end, 
level) per index entry, whereas in indices on longer paths require storing six integers per 
index entry. 

Since structure equivalence groups are determined based only on the XML schema, 
these groups are valid for all XML documents conforming to the XML schema. The 
SEGs cannot be determined by using data statistics because statistics do not indicate 
whether a node is contained in only one element type. Since some elements in XML 
data can be optional, they may not appear in XML document instances and thus may not 
appear in data statistics as well. 



5 Index Benefit Computation 

In this section, we describe the internal benefit models used by the XIST algorithm 
to compute the benefits of candidate path indices (CPIs). The total benefit of an index 
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Inputs: A set of existing indices S, a target workload W, and a CPI on path p. Ip 
Output: The benefit of Ip, B{Ip) 

ComputeIndexBenefltO 

// Fe and Fd are functions for benefit computation 

1. Be = 0 

2. for path pe G EQ{p) andpe G W 

3. Be = FE{p,Pe, S) 

4. Be ~ Be + Be 

5. endfor 

6 . Bd = 0 

7. for path pa with p as a subpath and pd &W 

8 . Bd = FD{p,Pd, S) 

9 . Bd = Bd + Bd 

10. endfor 

11. if data statistics are available 

12. B{Ip) = Be + Bd- U{Ip) 

13. else 

14. B{Ip) = Be + Bd 



Fig. 4. The Index Benefit Computation Algorithm for CPI Ip 

Ip, B{Ip), is computed as the sum of: (i) Be, which is the henefit of using Ip for 
answering queries on the equivalent paths of p (recall that all paths in an EQ share the 
same path index), and (ii) Be, which is the henefit of using Ip for answering queries on 
the dependent paths of p. Figure 4 presents an algorithm for computing the total benefit 
of Ip, B{Ip). 

5.1 Cost-Based Benefit Computation 

When data statistics are available, XIST can estimate the cost of evaluating paths more 
accurately. The collected data statistics consist of a sequence of tuples, each representing 
a path expression and the cardinality of the node set that matches the path (also called 
the cardinality of a path expression). XIST uses the path cardinality to predict path 
evaluation costs. In reality, however, these costs depend largely on the native storage 
implementation and the optimizer features of an XML engine. To address this issue, we 
approximate the path evaluation costs via abstract cost models based on our experimental 
native XML DBMS. 



Computing Evaluation Costs. The cost of evaluating a path with the index on the 
path is estimated to be proportional to the cardinality of the path since the path index 
is implemented using a hash index and the hash index access cost is proportional to the 
number of items retrieved from the hash index. Let C{p\ jpi , SU Ip^ /p^ ) be the cost of 
evaluating pi jp 2 . Then, 

C{pi/P2,s U IpjpJ Ki Ki X {\pi/p2\) (1) 

where Kj is a constant and \pi/p 2 \ is the cardinality of the nodes matched by pi/p 2 - 
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If an index on a path does not exist, XIST splits the path into two subpaths and then 
recursively evaluates them. When splitting the path, XIST needs to determine the join 
order of subpaths to minimize the join cost. The chosen pair has the minimal sum of 
the cardinalities of subpaths. Subpaths are recursively split until they can be answered 
using existing indices. After subpaths are evaluated, their results are recursively joined 
to answer the entire path. Finding an optimal join order is not the focus of this paper, 
but it has been recently proposed [26] . 

When XIST joins a path of two indexed subpaths, it uses a structural join algo- 
rithm [1], which guarantees that the worst case join cost is linear in the sum of sizes of 
the sorted input lists and the final result list. Let S be the set of indices which exclude 
the index on pi/p 2 , and C{pi/p 2 , S) be the cost of joining between path pi and path 
P 2 - Then, 

C{pi/p2,S) Ki Kj X (|pi| + \P2\ + \Pl/P2\) (2) 

where Kj is the constant and \pi \ is the estimated cardinality of the nodes that match 
Pi. The estimated cardinality of the nodes that match the paths are given as an input of 
the XIST tool (by the XML estimation module in the system). 

Since the maintenance cost for an index can be very expensive, XIST also considers 
the maintenance cost in the index benefit computation. The actual cost for updating a 
path index is very much dependent on the system implementation details, and different 
systems are likely to have different costs for index updates. In this paper, for simplicity, 
we use an update cost model in which the update cost for a given path index is proportional 
to the the number of entries being updated in the path index. (This cost model can be 
adapted in a fairly straightforward manner if the cost needs to include a log-based factor, 
which is a characteristic for tree-based indices.) 

Let U{Ip^fp.^) be the cost of updating the index on path p\jp 2 , then 

U{Ip,ip,)^Kijx(\p^lp2\) (3) 

where Kij is the constant and \p\lp 2 \ is the cardinality of the nodes that match p\jp 2 - 



Using Cost Models for Computing Benefits. Now we describe how the cost models 
are used to compute the total benefit of an index when data statistics are available. 
The benefit function FE{p,Pe, S) is the function to compute the benefit of using Ip to 
completely evaluate a path in the structure equivalence group of p (pe), assuming the 
set of indices S exists. The benefit function Fe{p, Pd, S) is the function to compute the 
benefit of using Ip to partially answer a dependent path of p (pd), assuming the set of 
indices S exists. 



Fe{p,p.,S) = C{p.,S)-C{p,SUlp) 

Fo{p,Pd,S) = C{pd,S)-C{pd,S\Plp) 



Fig. 5. Fe and Fd for Ip (with Statistics) 
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5.2 Heuristic-Based Benefit Computation 

When data statistics are not available, XIST estimates the benefit of the index by using 
the lengths of queries and the length of the candidate path (CP). The benefit of a candidate 
path index (CPI) is estimated based on: a) the number of joins required to answer queries 
with and without the CPI, and b) the use of a CPI to completely or partially evaluate a 
query. 

In the following paragraphs, we use the following notations: p is a CP, Ip is a CPI, Pe 
is an equivalent path of p (a path that Ip can completely answer), and p^i is a dependent 
path of p (a path that Ip can partially answer). 

We first consider the benefit of Ip when it can completely answer a query. This 
benefit is computed by the Fe function, which estimates the number of joins needed as 
the length of the shortest unindexed subpath of p. Let L{p) be the length of path p, S be 
the set of existing indices, and L {p, S) be the length of the shortest unindexed subpath 
in p. L {p, S) is the difference between the length of p and that of the longest indexed 
subpath of p. 

Next, we consider the beneht of Ip when it can partially answer a query. This benefit 
is computed by the Fe function. Like Fe, Fe estimates the number of joins needed to 
answer the query. However, in this case, the number of joins needed is more than just 
the length of an unindexed subpath of the query. The closer the length of p to the length 
of unindexed subpath of the query, the higher benefit of Ip is. We use the difference 
between the length of p and that of the query as the number of the joins that the index 
cannot answer. The benefit functions Fe and Fe are shown in Figure 6. 



FE{p,Pe,S) = l\p,,S) 

Fo{p,Pd,S) = L'{p^,S)-{L{pa)-L{p)) 

= L'{pd,S)-L{pd) + L{p) 



Fig. 6. Fe and Fd for Ip (Without Statistics) 



6 Configuration Enumeration 

After the beneht of each CPI is computed using Fe and Fd in the index beneht algorithm 
(Figure 4), the hrst two phases of the XIST algorithm (Figure 3) are completed. In the 
third phase, XIST hrst selects the CPI with the highest beneht to the set of chosen indices 
S. Since XIST takes the index interaction into account, it needs to recompute the benehts 
of CPIs that have not been chosen. The key idea in efficiently recomputing the benehts 
of CPIs is to recompute only the benehts of the indices on paths that are affected by the 
chosen indices. A naive algorithm would recompute the beneht of each CPI that has not 
been selected. In contrast, XIST employs a more efficient strategy no matter whether 
it uses the heuristic-based beneht computation or cost-based beneht computation. The 
strategy is briehy described below. 

XIST considers three types of paths that are affected by a selected index on path p: 
(a) subqueries that have not been evaluated and that contain p as a subpath, (b) paths 
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that are subpaths of p, and (c) paths that are not subpaths of p but are subpaths of paths 
in (a). 

Using these relationships between selected paths and other unselected paths, we can 
reduce the number of benefit re-computations for the unselected indices. We need to re- 
compute the benefits of unselected indices because the benefits of these indices depend 
on the existence of selected indices. The situation in which the benefits of one index 
depends on the existence of other indices is called index interaction. 

If we did not find such relationships between the selected indices and the unselected 
indices, we could spend an excessive amount of time in computing the benefits of many 
remaining unselected indices. 



7 Experimental Evaluation 

In this section, we present the results from an extensive experimental evaluation of XIST, 
and compare it with current index selection techniques. 



7.1 Experimental Setup 

The XIST tool that we implemented is a stand-alone C-H- application. It uses the Apache 
Xerces C-H- version 2.0 [20] to parse an XML schema. It also implements the selection 
and benefit evaluation of candidate indices, and the configuration enumeration. We then 
used the indices recommended by the XIST toolkit as an input to an native XML DBMS 
that we are currently developing. This system implements stack-based structural join 
algorithms [1]. It uses B-ttree to implement the value index, and uses a hash indexing 
mechanism to implement the path indices. It evaluates XML queries as follows: if a path 
query matches an indexed pathname, the nodes matching the path are retrieved from 
the path index. If there is no match, the DBMS uses the structural join algorithm [1] to 
join indexed subpaths. Queries on long paths are evaluated using a pipeline of structural 
join operators. The operators are ordered by the estimated cardinality of each join result, 
with the pair resulting in the smallest intermediate result being scheduled first. A query 
with a value-based predicate is executed by evaluating the value predicate first. 

In all our experiments, the DBMS was configured to use a 32 MB buffer pool. All 
experiments were performed on an 1.70 GHz Intel Xeon processor, running Debian 
Linux version 2.4.13. 



7.2 Data Sets and Queries 

We used the following four commonly used XML data sets: DBLP [18], Mondial [25], 
Shakespeare Plays [13], and XMark benchmark [22]. For each data set, we generated 
a workload of ten queries. These queries were generated using a query generator which 
takes the set of all distinct paths in the input XML documents as input. The detail of the 
generation method can be found in the full-length version of the paper [21]. 

As an example, using the generation method, some of the queries on the Plays data 
set are as follow: FM/P and /PLAY/ACT/EPILOGUE/SPEECH [SPEAKER="KING"] . 
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7.3 Experimental Results 

We now present experimental results that evaluate various aspects of the XIST toolkit. 
First, we demonstrate the effectiveness of the path structure equivalence group (SEG) 
in reducing the number of candidate paths. Then, we compare the performance of 
XIST with the performance of a number of alternative index selection schemes. We also 
performed the experiments to access the impact of the different types of inputs (namely 
query workload, XML schema and statistics) on the behavior of the XIST toolkit. In 
addition, we also analyzed the performance of all the index selection schemes when 
the workload changes over time. However, due to the space limitation, only partial 
experimental results of the impact of different types of inputs are presented here; more 
experimental results are presented in the full-length version of this paper [21]. 

The execution time numbers presented or analyzed in this paper are cold numbers, 
i.e., the queries do not benefit from having any pages cached in the buffer pool from a 
previous run of the system. 

Effectiveness of Structure Equivalence Groups. Paths in an structure equivalence 
group are represented by a single unique path which is the smallest path pointing to the 
same destination node. Therefore, the number of structure equivalence groups denotes 
the number of such unique paths. 

Figure 7 plots the number of paths and the number of structure equivalence groups for 
all the data sets used in this experimental evaluation. In this figure, DBLPl and XMarkl 
represent those paths from DBLP and XMark with lengths up to five, and DBLP2 and 
XMark2 represent those paths with lengths up to ten. As shown in Figure 7, the number 
of structure equivalence groups is fewer than the number of total paths by 35%-60%. 
This result validates our hypothesis that the number of candidate paths can be reduced 
significantly using the XML schema to exploit structural similarities. 



Comparison of Different Indexing Schemes. We compare the performance of the 
following sets of indices: indices on elements (Elem), indices on paths with length up 
to two (SP), indices suggested by XIST {XIST), and indices on the full path query 
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definitions (FP). The Elem index selection strategy is interesting because it is a minimal 
set of indices to answer any path query. SP is a set of indices on short paths. XIST is a set 
of indices chosen by the XIST tool with given all input information (schema, statistics, 
and query workload). FP is a set of indices that requires no join when evaluating paths 
without value-based predicates. 

All indexing schemes {Elem, SP, XIST, and FP) only build indices on paths without 
value-based predicates. To evaluate paths with value-based predicates, a join operation 
is used between the nodes returned from indices and the nodes that match the value 
predicates. We choose to separate the value indices from a path indices to avoid having 
an excessive number of indices - one for each possible different value-predicate. In our 
experimental setup, all indexing schemes share the same value indices to evaluate paths 
with value-based predicates. 

Figure 8 shows the performance improvement of XIST over other indices for the four 
experimental data sets. The results shown in Figure 8 illustrate that XIST consistently 
outperforms all other index selection methods for all the data sets. XIST is better than 
Elem and SP because XIST requires fewer joins for evaluating the queries. XIST performs 
better than FP largely because the use of structure equivalence groups (SEGs) while 
evaluating path queries. In many cases, long path queries without value predicates are 
equivalent to queries on a single element. In such cases, if the size of the element index 
is smaller than the size of the path index, XIST recommends using the element index to 
retrieve answer. On the other hand, FP needs to access the larger path index. Another 
reason for the improved performance with XIST is that for some data sets the total size 
of XIST indices (including element indices and path indices) is smaller than that of 
FP indices (including element indices and path indices). The total size of XIST indices 
is smaller because it shares a single index among the equivalent paths. Note that the 
equivalent paths cannot be determined when using FP since FP does not take a schema 
as an input information. FP takes only query workload as an input information. Table 1 
presents the sizes of data sets and indices for all data sets. 
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Table 1. Sizes of Data Sets and Indices 



Data Set 


Size 

(MB) 


Index Size (MB) 


Elem 


SP 


FP 


XIST 


DBLP 


117 


91 


117 


110 


101 


Mondial 


2 


2 


3 


3 


3 


Plays 


8 


80 


86 


85 


81 


XMark 


11 


27 


27 


27 


27 



7.4 Impact of Input Information on XIST 

In this section, we compared the execution times when using indices suggested by XIST 
with indices selected by other index selection strategies. 

Due to space limitations, in this paper we only present the results for DBLP and 
XMark when the workload information is available. DBLP represents a shallow data 
set (short paths), and XMark represents a deep data set (long paths). The experimental 
results for these two data sets are representative of the results for the other two data sets. 
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Fig. 10. Performance on Xmark 



When the workload information is available, FP and XIST exploit the information to 
build indices that can cover most of the queries. FP indices cover all paths without value- 
based predicates in the workload. Thus, its index selection is close to optimal (without 
any join). When using FP, the only joins that the database needs are the joins between 
the returned nodes from the path index and the nodes that satisfy the value predicates. 
In Figures 9-10, the number of FP indices is used to assign the initial number of indices 
that the XIST tool generates. 

As opposed to the heuristic-based benefit function, the cost-based benefit function 
guarantees that the more useful indices are chosen before the less useful indices. The 
execution times of XIST with QW-Stats (QW-Schema-Stats) gradually decrease as 
opposed to the execution times of XIST with QW (QW-Schema). This is particularly no- 
ticeable in Figure 10. 
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8 Related Work 

In the 1-index [19], data nodes that are bisimilar from the root node stored in the same 
node in the index graph. The size of the 1 -index can be very large compared to the data 
size, thus A(k)-index [16] has been proposed to make a trade off between the index 
performance and the index size. While k-bisimilarity [16] is determined by using XML 
data, the SEGs in this paper are determined by using an XML schema. Recently, D(k)- 
index [6], which is also based on the concept of bisimilarity, has been proposed as an 
adaptive structural summary. Like XIST, D(k) also takes the query workload as an input. 
However, XIST also takes the XML schema into account while D(k) does not. Although 
both k-bisimilarity and SEGs group paths that lead to the nodes with the same label, 
SEGs group paths in an XML schema but k-bisimilarity group paths in XML data. 

Chung et al. have proposed APEX [9], an adaptive path index for XML documents. 
Like APEX, XIST exploits the query workload to find indices that are most likely to 
be useful. On the other hand, APEX does not distinguish the benefit of indices on two 
paths with same frequencies, but XIST does. In addition, APEX does not exploit data 
statistics and XML schema in index selection as opposed to XIST. 

Recently, Kaushik et al. have proposed E&B-indexes that use the structural features 
of the input XML documents [15]. E&B indexes are forward-and-backward indices 
for answering branching path queries. Some heuristics in choosing indices, such as 
prioritizing short path indices over long path indices are proposed [15]. On the other 
hand, XIST takes many additional parameters, such as the information from a schema 
or a query workload. 

Many commercial relational database systems employ index selection features in 
their query optimizers. IBM’s DB2 Advisor [24] recommends candidate indices based 
on the analysis of workload of SQL queries and models the index selection problem 
as a variation of the knapsack problem. The Microsoft SQL Server [3,4] uses simpler 
single-column indices in an iterative manner to recommend multi-column indices. XIST 
groups a set of paths (a set of multiple-columns) that can share the index. 

Our work is closest to the index selection schemes proposed by Chawathe et al. [5] for 
object oriented databases. Both the index selection schemes [5] and XIST find the index 
interaction through the relationships between subpath indices and queries. However, 
XIST exploits the structural information to reduce the number of candidate indices 
and optimize the query processing of XML queries while [5] only looks at the query 
workload to choose candidate indices for evaluating object-oriented queries. 



9 Conclusions 

In this paper, we have described XIST, an XML index selection tool, which recommends 
a set of path indices given a combination of a query workload, a schema, and data 
statistics. By exploiting structural summaries from schema descriptions, the number of 
candidate indices can be substantially reduced for most XML data sets and workloads. 
XIST incorporates a robust beneht analysis technique using cost models or a simplihed 
heuristic. It also models the ability of an index to effectively evaluate sub-paths of a 
path expression. Our experimental evaluation demonstrates that the indices selected by 
XIST perform better compared to existing methods. In our experimental evaluation. 
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we had to tailor the cost model used in XIST to accurately model the techniques that are 
implemented in our native XML system. However, we believe that the general framework 
of XIST, with its use of structure equivalence groups and efficient benefit recomputation 
methods, can be adapted for use with other DBMSs with different implementation and 
query evaluation algorithms. To adapt this general framework to other systems, accurate 
cost models equations are required that account for the system-specific details. Within the 
scope of this paper, we have chosen to focus on the general framework and algorithms of 
an XML index selection tool, and have demonstrated that its effectiveness for our native 
XML DBMS. 

In the future, we plan on extending XIST to include additional types of path indices, 
such as indices on regular path expressions and on twig queries. 
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